Re: spark 3.2.0 the different dataframe createOrReplaceTempView the same name TempView

2021-12-13 Thread Daniel de Oliveira Mantovani
You are correct, I understand. My only concern is the back compatibility
problem, which worked for the previous version of Apache Spark. It's
painful when an OOTB feature breaks without documentation or a workaround
like "spark.sql.legacy.keepSqlRecursive" true/false. It's not about "my
code", it is about all production code running out there.

Thank you so much

On Mon, Dec 13, 2021 at 2:32 PM Sean Owen  wrote:

> I think we're around in circles - you should not do this. You essentially
> have "__TABLE__ = SELECT * FROM __TABLE__" and I hope it's clear why that
> can't work in general.
> At first execution, sure, maybe "old" __TABLE__ refers to "SELECT 1", but
> what about the second time? if you stick to that interpretation, it's
> actually not executing correctly, though 'works'. If you execute it as is,
> it fails for circularity. Both are bad, so it's just disallowed.
> Just fix your code?
>
> On Mon, Dec 13, 2021 at 11:27 AM Daniel de Oliveira Mantovani <
> daniel.oliveira.mantov...@gmail.com> wrote:
>
>> I've reduced the code to reproduce the issue,
>>
>> val df = spark.sql("SELECT 1")
>> df.createOrReplaceTempView("__TABLE__")
>> spark.sql("SELECT * FROM __TABLE__").show
>> val df2 = spark.sql("SELECT *,2 FROM __TABLE__")
>> df2.createOrReplaceTempView("__TABLE__") // Exception in Spark 3.2 but
>> works for Spark 2.4.x and Spark 3.1.x
>> spark.sql("SELECT * FROM __TABLE__").show
>>
>> org.apache.spark.sql.AnalysisException: Recursive view `__TABLE__`
>> detected (cycle: `__TABLE__` -> `__TABLE__`)
>>   at
>> org.apache.spark.sql.errors.QueryCompilationErrors$.recursiveViewDetectedError(QueryCompilationErrors.scala:2045)
>>   at
>> org.apache.spark.sql.execution.command.ViewHelper$.checkCyclicViewReference(views.scala:515)
>>   at
>> org.apache.spark.sql.execution.command.ViewHelper$.$anonfun$checkCyclicViewReference$2(views.scala:522)
>>   at
>> org.apache.spark.sql.execution.command.ViewHelper$.$anonfun$checkCyclicViewReference$2$adapted(views.scala:522)
>>
>> On Mon, Dec 13, 2021 at 2:10 PM Sean Owen  wrote:
>>
>>> _shrug_ I think this is a bug fix, unless I am missing something here.
>>> You shouldn't just use __TABLE__ for everything, and I'm not seeing a good
>>> reason to do that other than it's what you do now.
>>> I'm not clear if it's coming across that this _can't_ work in the
>>> general case.
>>>
>>> On Mon, Dec 13, 2021 at 11:03 AM Daniel de Oliveira Mantovani <
>>> daniel.oliveira.mantov...@gmail.com> wrote:
>>>
>>>>
>>>> In this context, I don't want to worry about the name of the temporary
>>>> table. That's why it is "__TABLE__".
>>>> The point is that this behavior for Spark 3.2.x it's breaking back
>>>> compatibility for all previous versions of Apache Spark. In my opinion we
>>>> should at least have some flag like "spark.sql.legacy.keepSqlRecursive"
>>>> true/false.
>>>>
>>>
>>
>> --
>>
>> --
>> Daniel Mantovani
>>
>>

-- 

--
Daniel Mantovani


Re: spark 3.2.0 the different dataframe createOrReplaceTempView the same name TempView

2021-12-13 Thread Daniel de Oliveira Mantovani
I've reduced the code to reproduce the issue,

val df = spark.sql("SELECT 1")
df.createOrReplaceTempView("__TABLE__")
spark.sql("SELECT * FROM __TABLE__").show
val df2 = spark.sql("SELECT *,2 FROM __TABLE__")
df2.createOrReplaceTempView("__TABLE__") // Exception in Spark 3.2 but
works for Spark 2.4.x and Spark 3.1.x
spark.sql("SELECT * FROM __TABLE__").show

org.apache.spark.sql.AnalysisException: Recursive view `__TABLE__` detected
(cycle: `__TABLE__` -> `__TABLE__`)
  at
org.apache.spark.sql.errors.QueryCompilationErrors$.recursiveViewDetectedError(QueryCompilationErrors.scala:2045)
  at
org.apache.spark.sql.execution.command.ViewHelper$.checkCyclicViewReference(views.scala:515)
  at
org.apache.spark.sql.execution.command.ViewHelper$.$anonfun$checkCyclicViewReference$2(views.scala:522)
  at
org.apache.spark.sql.execution.command.ViewHelper$.$anonfun$checkCyclicViewReference$2$adapted(views.scala:522)

On Mon, Dec 13, 2021 at 2:10 PM Sean Owen  wrote:

> _shrug_ I think this is a bug fix, unless I am missing something here. You
> shouldn't just use __TABLE__ for everything, and I'm not seeing a good
> reason to do that other than it's what you do now.
> I'm not clear if it's coming across that this _can't_ work in the general
> case.
>
> On Mon, Dec 13, 2021 at 11:03 AM Daniel de Oliveira Mantovani <
> daniel.oliveira.mantov...@gmail.com> wrote:
>
>>
>> In this context, I don't want to worry about the name of the temporary
>> table. That's why it is "__TABLE__".
>> The point is that this behavior for Spark 3.2.x it's breaking back
>> compatibility for all previous versions of Apache Spark. In my opinion we
>> should at least have some flag like "spark.sql.legacy.keepSqlRecursive"
>> true/false.
>>
>

-- 

--
Daniel Mantovani


Re: spark 3.2.0 the different dataframe createOrReplaceTempView the same name TempView

2021-12-13 Thread Daniel de Oliveira Mantovani
In this context, I don't want to worry about the name of the temporary
table. That's why it is "__TABLE__".
The point is that this behavior for Spark 3.2.x it's breaking back
compatibility for all previous versions of Apache Spark. In my opinion we
should at least have some flag like "spark.sql.legacy.keepSqlRecursive"
true/false.

On Mon, Dec 13, 2021 at 1:47 PM Sean Owen  wrote:

> You can replace temp views. Again: what you can't do here is define a temp
> view in terms of itself. If you are reusing the same name over and over,
> it's probably easy to do that, so you don't want to do that. You want
> different names for different temp views, or else ensure you aren't doing
> the kind of thing shown in the SO post. You get the problem right?
>
> On Mon, Dec 13, 2021 at 10:43 AM Daniel de Oliveira Mantovani <
> daniel.oliveira.mantov...@gmail.com> wrote:
>
>> I didn't post the SO issue, I've just found the same exception I'm facing
>> for Spark 3.2. Almaren Framework has a concept of create temporary views
>> with the name "__TABLE__".
>>
>> Example, if you want to use SQL dialect to a DataFrame to join a
>> table/aggregation/apply a function whatever. Instead of you create a
>> temporary table you just use the "__TABLE__" alias. You don't really care
>> about the name of the table. You may use this "__TABLE__" approach in
>> different parts of your code.
>>
>> Why can't I create or replace temporary views in different DataFrame with
>> the same name as before ?
>>
>>
>>
>> On Mon, Dec 13, 2021 at 1:27 PM Sean Owen  wrote:
>>
>>> If the issue is what you posted in SO, I think the stack trace explains
>>> it already. You want to avoid this recursive definition, which in general
>>> can't work.
>>> I think it's simply explicitly disallowed in all cases now, but, you
>>> should not be depending on this anyway - why can't this just be avoided?
>>>
>>> On Mon, Dec 13, 2021 at 10:06 AM Daniel de Oliveira Mantovani <
>>> daniel.oliveira.mantov...@gmail.com> wrote:
>>>
>>>> Sean,
>>>>
>>>> https://github.com/music-of-the-ainur/almaren-framework/tree/spark-3.2
>>>>
>>>> Just executing "sbt test" will reproduce the error, the same code works
>>>> for spark 2.3.x, 2.4.x and 3.1.x why doesn't it work for spark 3.2 ?
>>>>
>>>> Thank you so much
>>>>
>>>>
>>>>
>>>> On Mon, Dec 13, 2021 at 12:59 PM Sean Owen  wrote:
>>>>
>>>>> ... but the error is not "because that already exists". See your stack
>>>>> trace. It's because the definition is recursive. You define temp view
>>>>> test1, create a second DF from it, and then redefine test1 as that result.
>>>>> test1 depends on test1.
>>>>>
>>>>> On Mon, Dec 13, 2021 at 9:58 AM Daniel de Oliveira Mantovani <
>>>>> daniel.oliveira.mantov...@gmail.com> wrote:
>>>>>
>>>>>> Sean,
>>>>>>
>>>>>> The method name is very clear "createOrReplaceTempView"  doesn't make
>>>>>> any sense to throw an exception because this view already exists. Spark
>>>>>> 3.2.x is breaking back compatibility with no reason or sense.
>>>>>>
>>>>>>
>>>>>> On Mon, Dec 13, 2021 at 12:53 PM Sean Owen  wrote:
>>>>>>
>>>>>>> The error looks 'valid' - you define a temp view in terms of its own
>>>>>>> previous version, which doesn't quite make sense - somewhere the new
>>>>>>> definition depends on the old definition. I think it just correctly
>>>>>>> surfaces as an error now,.
>>>>>>>
>>>>>>> On Mon, Dec 13, 2021 at 9:41 AM Daniel de Oliveira Mantovani <
>>>>>>> daniel.oliveira.mantov...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello team,
>>>>>>>>
>>>>>>>> I've found this issue while I was porting my project from Apache
>>>>>>>> Spark 3.1.x to 3.2.x.
>>>>>>>>
>>>>>>>>
>>>>>>>> https://stackoverflow.com/questions/69937415/spark-3-2-0-the-different-dataframe-createorreplacetempview-the-same-name-tempvi
>>>>>>>>
>>>>>>>> Do we have a bug for that in apache-spark or I need to create one ?
>>>>

Re: spark 3.2.0 the different dataframe createOrReplaceTempView the same name TempView

2021-12-13 Thread Daniel de Oliveira Mantovani
I didn't post the SO issue, I've just found the same exception I'm facing
for Spark 3.2. Almaren Framework has a concept of create temporary views
with the name "__TABLE__".

Example, if you want to use SQL dialect to a DataFrame to join a
table/aggregation/apply a function whatever. Instead of you create a
temporary table you just use the "__TABLE__" alias. You don't really care
about the name of the table. You may use this "__TABLE__" approach in
different parts of your code.

Why can't I create or replace temporary views in different DataFrame with
the same name as before ?



On Mon, Dec 13, 2021 at 1:27 PM Sean Owen  wrote:

> If the issue is what you posted in SO, I think the stack trace explains it
> already. You want to avoid this recursive definition, which in general
> can't work.
> I think it's simply explicitly disallowed in all cases now, but, you
> should not be depending on this anyway - why can't this just be avoided?
>
> On Mon, Dec 13, 2021 at 10:06 AM Daniel de Oliveira Mantovani <
> daniel.oliveira.mantov...@gmail.com> wrote:
>
>> Sean,
>>
>> https://github.com/music-of-the-ainur/almaren-framework/tree/spark-3.2
>>
>> Just executing "sbt test" will reproduce the error, the same code works
>> for spark 2.3.x, 2.4.x and 3.1.x why doesn't it work for spark 3.2 ?
>>
>> Thank you so much
>>
>>
>>
>> On Mon, Dec 13, 2021 at 12:59 PM Sean Owen  wrote:
>>
>>> ... but the error is not "because that already exists". See your stack
>>> trace. It's because the definition is recursive. You define temp view
>>> test1, create a second DF from it, and then redefine test1 as that result.
>>> test1 depends on test1.
>>>
>>> On Mon, Dec 13, 2021 at 9:58 AM Daniel de Oliveira Mantovani <
>>> daniel.oliveira.mantov...@gmail.com> wrote:
>>>
>>>> Sean,
>>>>
>>>> The method name is very clear "createOrReplaceTempView"  doesn't make
>>>> any sense to throw an exception because this view already exists. Spark
>>>> 3.2.x is breaking back compatibility with no reason or sense.
>>>>
>>>>
>>>> On Mon, Dec 13, 2021 at 12:53 PM Sean Owen  wrote:
>>>>
>>>>> The error looks 'valid' - you define a temp view in terms of its own
>>>>> previous version, which doesn't quite make sense - somewhere the new
>>>>> definition depends on the old definition. I think it just correctly
>>>>> surfaces as an error now,.
>>>>>
>>>>> On Mon, Dec 13, 2021 at 9:41 AM Daniel de Oliveira Mantovani <
>>>>> daniel.oliveira.mantov...@gmail.com> wrote:
>>>>>
>>>>>> Hello team,
>>>>>>
>>>>>> I've found this issue while I was porting my project from Apache
>>>>>> Spark 3.1.x to 3.2.x.
>>>>>>
>>>>>>
>>>>>> https://stackoverflow.com/questions/69937415/spark-3-2-0-the-different-dataframe-createorreplacetempview-the-same-name-tempvi
>>>>>>
>>>>>> Do we have a bug for that in apache-spark or I need to create one ?
>>>>>>
>>>>>> Thank you so much
>>>>>>
>>>>>> [info] com.github.music.of.the.ainur.almaren.Test *** ABORTED ***
>>>>>> [info]   org.apache.spark.sql.AnalysisException: Recursive view
>>>>>> `__TABLE__` detected (cycle: `__TABLE__` -> `__TABLE__`)
>>>>>> [info]   at
>>>>>> org.apache.spark.sql.errors.QueryCompilationErrors$.recursiveViewDetectedError(QueryCompilationErrors.scala:2045)
>>>>>> [info]   at
>>>>>> org.apache.spark.sql.execution.command.ViewHelper$.checkCyclicViewReference(views.scala:515)
>>>>>> [info]   at
>>>>>> org.apache.spark.sql.execution.command.ViewHelper$.$anonfun$checkCyclicViewReference$2(views.scala:522)
>>>>>> [info]   at
>>>>>> org.apache.spark.sql.execution.command.ViewHelper$.$anonfun$checkCyclicViewReference$2$adapted(views.scala:522)
>>>>>> [info]   at scala.collection.Iterator.foreach(Iterator.scala:941)
>>>>>> [info]   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>>>>>> [info]   at
>>>>>> scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>>>>>> [info]   at
>>>>>> scala.collection.IterableLike.foreach(IterableLike.scala:74)
>>>>>> [info]   at
>>>>>> scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>>>>>> [info]   at
>>>>>> scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>>>>>>
>>>>>> --
>>>>>>
>>>>>> --
>>>>>> Daniel Mantovani
>>>>>>
>>>>>>
>>>>
>>>> --
>>>>
>>>> --
>>>> Daniel Mantovani
>>>>
>>>>
>>
>> --
>>
>> --
>> Daniel Mantovani
>>
>>

-- 

--
Daniel Mantovani


Re: spark 3.2.0 the different dataframe createOrReplaceTempView the same name TempView

2021-12-13 Thread Daniel de Oliveira Mantovani
Sean,

https://github.com/music-of-the-ainur/almaren-framework/tree/spark-3.2

Just executing "sbt test" will reproduce the error, the same code works for
spark 2.3.x, 2.4.x and 3.1.x why doesn't it work for spark 3.2 ?

Thank you so much



On Mon, Dec 13, 2021 at 12:59 PM Sean Owen  wrote:

> ... but the error is not "because that already exists". See your stack
> trace. It's because the definition is recursive. You define temp view
> test1, create a second DF from it, and then redefine test1 as that result.
> test1 depends on test1.
>
> On Mon, Dec 13, 2021 at 9:58 AM Daniel de Oliveira Mantovani <
> daniel.oliveira.mantov...@gmail.com> wrote:
>
>> Sean,
>>
>> The method name is very clear "createOrReplaceTempView"  doesn't make any
>> sense to throw an exception because this view already exists. Spark 3.2.x
>> is breaking back compatibility with no reason or sense.
>>
>>
>> On Mon, Dec 13, 2021 at 12:53 PM Sean Owen  wrote:
>>
>>> The error looks 'valid' - you define a temp view in terms of its own
>>> previous version, which doesn't quite make sense - somewhere the new
>>> definition depends on the old definition. I think it just correctly
>>> surfaces as an error now,.
>>>
>>> On Mon, Dec 13, 2021 at 9:41 AM Daniel de Oliveira Mantovani <
>>> daniel.oliveira.mantov...@gmail.com> wrote:
>>>
>>>> Hello team,
>>>>
>>>> I've found this issue while I was porting my project from Apache Spark
>>>> 3.1.x to 3.2.x.
>>>>
>>>>
>>>> https://stackoverflow.com/questions/69937415/spark-3-2-0-the-different-dataframe-createorreplacetempview-the-same-name-tempvi
>>>>
>>>> Do we have a bug for that in apache-spark or I need to create one ?
>>>>
>>>> Thank you so much
>>>>
>>>> [info] com.github.music.of.the.ainur.almaren.Test *** ABORTED ***
>>>> [info]   org.apache.spark.sql.AnalysisException: Recursive view
>>>> `__TABLE__` detected (cycle: `__TABLE__` -> `__TABLE__`)
>>>> [info]   at
>>>> org.apache.spark.sql.errors.QueryCompilationErrors$.recursiveViewDetectedError(QueryCompilationErrors.scala:2045)
>>>> [info]   at
>>>> org.apache.spark.sql.execution.command.ViewHelper$.checkCyclicViewReference(views.scala:515)
>>>> [info]   at
>>>> org.apache.spark.sql.execution.command.ViewHelper$.$anonfun$checkCyclicViewReference$2(views.scala:522)
>>>> [info]   at
>>>> org.apache.spark.sql.execution.command.ViewHelper$.$anonfun$checkCyclicViewReference$2$adapted(views.scala:522)
>>>> [info]   at scala.collection.Iterator.foreach(Iterator.scala:941)
>>>> [info]   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>>>> [info]   at
>>>> scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>>>> [info]   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>>>> [info]   at
>>>> scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>>>> [info]   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>>>>
>>>> --
>>>>
>>>> --
>>>> Daniel Mantovani
>>>>
>>>>
>>
>> --
>>
>> --
>> Daniel Mantovani
>>
>>

-- 

--
Daniel Mantovani


Re: spark 3.2.0 the different dataframe createOrReplaceTempView the same name TempView

2021-12-13 Thread Daniel de Oliveira Mantovani
Sean,

The method name is very clear "createOrReplaceTempView"  doesn't make any
sense to throw an exception because this view already exists. Spark 3.2.x
is breaking back compatibility with no reason or sense.


On Mon, Dec 13, 2021 at 12:53 PM Sean Owen  wrote:

> The error looks 'valid' - you define a temp view in terms of its own
> previous version, which doesn't quite make sense - somewhere the new
> definition depends on the old definition. I think it just correctly
> surfaces as an error now,.
>
> On Mon, Dec 13, 2021 at 9:41 AM Daniel de Oliveira Mantovani <
> daniel.oliveira.mantov...@gmail.com> wrote:
>
>> Hello team,
>>
>> I've found this issue while I was porting my project from Apache Spark
>> 3.1.x to 3.2.x.
>>
>>
>> https://stackoverflow.com/questions/69937415/spark-3-2-0-the-different-dataframe-createorreplacetempview-the-same-name-tempvi
>>
>> Do we have a bug for that in apache-spark or I need to create one ?
>>
>> Thank you so much
>>
>> [info] com.github.music.of.the.ainur.almaren.Test *** ABORTED ***
>> [info]   org.apache.spark.sql.AnalysisException: Recursive view
>> `__TABLE__` detected (cycle: `__TABLE__` -> `__TABLE__`)
>> [info]   at
>> org.apache.spark.sql.errors.QueryCompilationErrors$.recursiveViewDetectedError(QueryCompilationErrors.scala:2045)
>> [info]   at
>> org.apache.spark.sql.execution.command.ViewHelper$.checkCyclicViewReference(views.scala:515)
>> [info]   at
>> org.apache.spark.sql.execution.command.ViewHelper$.$anonfun$checkCyclicViewReference$2(views.scala:522)
>> [info]   at
>> org.apache.spark.sql.execution.command.ViewHelper$.$anonfun$checkCyclicViewReference$2$adapted(views.scala:522)
>> [info]   at scala.collection.Iterator.foreach(Iterator.scala:941)
>> [info]   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>> [info]   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>> [info]   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>> [info]   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>> [info]   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>>
>> --
>>
>> --
>> Daniel Mantovani
>>
>>

-- 

--
Daniel Mantovani


spark 3.2.0 the different dataframe createOrReplaceTempView the same name TempView

2021-12-13 Thread Daniel de Oliveira Mantovani
Hello team,

I've found this issue while I was porting my project from Apache Spark
3.1.x to 3.2.x.

https://stackoverflow.com/questions/69937415/spark-3-2-0-the-different-dataframe-createorreplacetempview-the-same-name-tempvi

Do we have a bug for that in apache-spark or I need to create one ?

Thank you so much

[info] com.github.music.of.the.ainur.almaren.Test *** ABORTED ***
[info]   org.apache.spark.sql.AnalysisException: Recursive view `__TABLE__`
detected (cycle: `__TABLE__` -> `__TABLE__`)
[info]   at
org.apache.spark.sql.errors.QueryCompilationErrors$.recursiveViewDetectedError(QueryCompilationErrors.scala:2045)
[info]   at
org.apache.spark.sql.execution.command.ViewHelper$.checkCyclicViewReference(views.scala:515)
[info]   at
org.apache.spark.sql.execution.command.ViewHelper$.$anonfun$checkCyclicViewReference$2(views.scala:522)
[info]   at
org.apache.spark.sql.execution.command.ViewHelper$.$anonfun$checkCyclicViewReference$2$adapted(views.scala:522)
[info]   at scala.collection.Iterator.foreach(Iterator.scala:941)
[info]   at scala.collection.Iterator.foreach$(Iterator.scala:941)
[info]   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
[info]   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
[info]   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
[info]   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)

-- 

--
Daniel Mantovani


Re: How to Flatten JSON/XML/Parquet/etc in Apache Spark With Quenya-DSL

2021-11-22 Thread Daniel de Oliveira Mantovani
Hi Mich,

Unfortunately it doesn't support PySpark, just Scala/Java. Wouldn't be a
big deal to implement the Quenya DSL for PySpark as well, I will add to the
roadmap.

Thank you

On Mon, Nov 22, 2021 at 1:24 PM Mich Talebzadeh 
wrote:

> Ok interesting Daniel,
>
> I did not see support for PySpark. Is this work in progress?
>
> HTH
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 22 Nov 2021 at 16:00, Daniel de Oliveira Mantovani <
> daniel.oliveira.mantov...@gmail.com> wrote:
>
>> Hi Spark Team,
>>
>> I've written a library for Apache Spark to flatten JSON/Avro/Parquet/XML
>> using a DSL(Domain Specific Language) in Apache Spark. You actually don't
>> even need to write the DSL, you can generate it as well :)
>>
>> I've written an article to teach how to use:
>>
>> https://medium.com/@danielmantovani/flattening-json-in-apache-spark-with-quenya-dsl-b3af6bd2442d
>>
>> Project Page:
>> https://github.com/modakanalytics/quenya-dsl
>>
>> For all data engineers who won't spend time anymore flattening nested
>> data structures
>> XOXO
>>
>> --
>>
>> --
>> Daniel Mantovani
>>
>>

-- 

--
Daniel Mantovani


How to Flatten JSON/XML/Parquet/etc in Apache Spark With Quenya-DSL

2021-11-22 Thread Daniel de Oliveira Mantovani
Hi Spark Team,

I've written a library for Apache Spark to flatten JSON/Avro/Parquet/XML
using a DSL(Domain Specific Language) in Apache Spark. You actually don't
even need to write the DSL, you can generate it as well :)

I've written an article to teach how to use:
https://medium.com/@danielmantovani/flattening-json-in-apache-spark-with-quenya-dsl-b3af6bd2442d

Project Page:
https://github.com/modakanalytics/quenya-dsl

For all data engineers who won't spend time anymore flattening nested data
structures
XOXO

-- 

--
Daniel Mantovani


Re: Unable to write data into hive table using Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED

2021-07-20 Thread Daniel de Oliveira Mantovani
>From the Cloudera Documentation:
https://docs.cloudera.com/documentation/other/connectors/hive-jdbc/latest/Cloudera-JDBC-Driver-for-Apache-Hive-Install-Guide.pdf

UseNativeQuery
 1: The driver does not transform the queries emitted by applications, so
the native query is used.
 0: The driver transforms the queries emitted by applications and converts
them into an equivalent form in HiveQL.


Try to change the "NativeQuery" parameter and see if it works :)

On Tue, Jul 20, 2021 at 1:26 PM Daniel de Oliveira Mantovani <
daniel.oliveira.mantov...@gmail.com> wrote:

> Insert mode is "overwrite", it shouldn't doesn't matter if the table
> already exists or not. The JDBC driver should be based on the Cloudera Hive
> version, we can't know the CDH version he's using.
>
> On Tue, Jul 20, 2021 at 1:21 PM Mich Talebzadeh 
> wrote:
>
>> The driver is fine and latest and  it should work.
>>
>> I have asked the thread owner to send the DDL of the table and how the
>> table is created. In this case JDBC from Spark expects the table to be
>> there.
>>
>> The error below
>>
>> java.sql.SQLException: [Cloudera][HiveJDBCDriver](500051) ERROR
>> processing query/statement. Error Code: 4, SQL state:
>> TStatus(statusCode:ERROR_STATUS,
>> infoMessages:[*org.apache.hive.service.cli.HiveSQLException:Error while
>> compiling statement: FAILED: ParseException line 1:39 cannot recognize
>> input near '"first_name"' 'TEXT' ',' in column name or primary key or
>> foreign key:28:27,
>> org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:329
>>
>> Sounds like a mismatch between the columns through Spark Dataframe and
>> the underlying table.
>>
>> HTH
>>
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 20 Jul 2021 at 17:05, Daniel de Oliveira Mantovani <
>> daniel.oliveira.mantov...@gmail.com> wrote:
>>
>>> Badrinath is trying to write to a Hive in a cluster where he doesn't
>>> have permission to submit spark jobs, he doesn't have Hive/Spark metadata
>>> access.
>>> The only way to communicate with this third-party Hive cluster is
>>> through JDBC protocol.
>>>
>>> [ Cloudera Data Hub - Hive Server] <-> [Spark Standalone]
>>>
>>> Who's creating this table is "Spark" because he's using "overwrite" in
>>> order to test it.
>>>
>>>  df.write
>>>   .format("jdbc")
>>>   .option("url",
>>> "jdbc:hive2://localhost:1/foundation;AuthMech=2;UseNativeQuery=0")
>>>   .option("dbtable", "test.test")
>>>   .option("user", "admin")
>>>   .option("password", "admin")
>>>   .option("driver", "com.cloudera.hive.jdbc41.HS2Driver")
>>> *  .mode("overwrite")*
>>>   .save
>>>
>>> This error is weird, looks like the third-party Hive server isn't able
>>> to recognize the SQL dialect coming from  [Spark Standalone] server
>>> JDBC driver.
>>>
>>> 1) I would try to execute the create statement manually in this server
>>> 2) if works try to run again with "append" option
>>>
>>> I would open a case with Cloudera and ask which driver you should use.
>>>
>>> Thanks
>>>
>>>
>>>
>>> On Mon, Jul 19, 2021 at 10:33 AM Artemis User 
>>> wrote:
>>>
>>>> As Mich mentioned, no need to use jdbc API, using the DataFrameWriter's
>>>> saveAsTable method is the way to go.   JDBC Driver is for a JDBC client (a
>>>> Java client for instance) to access the Hive tables in Spark via the Thrift
>>>> server interface.
>>>>
>>>> -- ND
>>>>
>>>> On 7/19/21 2:42 AM, Badrinath Patchikolla wrote:
>>>>
>>>> I have trying to create table in hive from spark itself,
>>>>
>>>> And using local mode it will work what I am trying here is from spark
>>>> standalone I want to create th

Re: Unable to write data into hive table using Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED

2021-07-20 Thread Daniel de Oliveira Mantovani
Insert mode is "overwrite", it shouldn't doesn't matter if the table
already exists or not. The JDBC driver should be based on the Cloudera Hive
version, we can't know the CDH version he's using.

On Tue, Jul 20, 2021 at 1:21 PM Mich Talebzadeh 
wrote:

> The driver is fine and latest and  it should work.
>
> I have asked the thread owner to send the DDL of the table and how the
> table is created. In this case JDBC from Spark expects the table to be
> there.
>
> The error below
>
> java.sql.SQLException: [Cloudera][HiveJDBCDriver](500051) ERROR
> processing query/statement. Error Code: 4, SQL state:
> TStatus(statusCode:ERROR_STATUS,
> infoMessages:[*org.apache.hive.service.cli.HiveSQLException:Error while
> compiling statement: FAILED: ParseException line 1:39 cannot recognize
> input near '"first_name"' 'TEXT' ',' in column name or primary key or
> foreign key:28:27,
> org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:329
>
> Sounds like a mismatch between the columns through Spark Dataframe and the
> underlying table.
>
> HTH
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 20 Jul 2021 at 17:05, Daniel de Oliveira Mantovani <
> daniel.oliveira.mantov...@gmail.com> wrote:
>
>> Badrinath is trying to write to a Hive in a cluster where he doesn't have
>> permission to submit spark jobs, he doesn't have Hive/Spark metadata
>> access.
>> The only way to communicate with this third-party Hive cluster is through
>> JDBC protocol.
>>
>> [ Cloudera Data Hub - Hive Server] <-> [Spark Standalone]
>>
>> Who's creating this table is "Spark" because he's using "overwrite" in
>> order to test it.
>>
>>  df.write
>>   .format("jdbc")
>>   .option("url",
>> "jdbc:hive2://localhost:1/foundation;AuthMech=2;UseNativeQuery=0")
>>   .option("dbtable", "test.test")
>>   .option("user", "admin")
>>   .option("password", "admin")
>>   .option("driver", "com.cloudera.hive.jdbc41.HS2Driver")
>> *  .mode("overwrite")*
>>   .save
>>
>> This error is weird, looks like the third-party Hive server isn't able to
>> recognize the SQL dialect coming from  [Spark Standalone] server JDBC
>> driver.
>>
>> 1) I would try to execute the create statement manually in this server
>> 2) if works try to run again with "append" option
>>
>> I would open a case with Cloudera and ask which driver you should use.
>>
>> Thanks
>>
>>
>>
>> On Mon, Jul 19, 2021 at 10:33 AM Artemis User 
>> wrote:
>>
>>> As Mich mentioned, no need to use jdbc API, using the DataFrameWriter's
>>> saveAsTable method is the way to go.   JDBC Driver is for a JDBC client (a
>>> Java client for instance) to access the Hive tables in Spark via the Thrift
>>> server interface.
>>>
>>> -- ND
>>>
>>> On 7/19/21 2:42 AM, Badrinath Patchikolla wrote:
>>>
>>> I have trying to create table in hive from spark itself,
>>>
>>> And using local mode it will work what I am trying here is from spark
>>> standalone I want to create the manage table in hive (another spark cluster
>>> basically CDH) using jdbc mode.
>>>
>>> When I try that below are the error I am facing.
>>>
>>> On Thu, 15 Jul, 2021, 9:55 pm Mich Talebzadeh, <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Have you created that table in Hive or are you trying to create it from
>>>> Spark itself.
>>>>
>>>> You Hive is local. In this case you don't need a JDBC connection. Have
>>>> you tried:
>>>>
>>>> df2.write.mode("overwrite").saveAsTable(mydb.mytable)
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at you

Re: Unable to write data into hive table using Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED

2021-07-20 Thread Daniel de Oliveira Mantovani
Badrinath is trying to write to a Hive in a cluster where he doesn't have
permission to submit spark jobs, he doesn't have Hive/Spark metadata
access.
The only way to communicate with this third-party Hive cluster is through
JDBC protocol.

[ Cloudera Data Hub - Hive Server] <-> [Spark Standalone]

Who's creating this table is "Spark" because he's using "overwrite" in
order to test it.

 df.write
  .format("jdbc")
  .option("url",
"jdbc:hive2://localhost:1/foundation;AuthMech=2;UseNativeQuery=0")
  .option("dbtable", "test.test")
  .option("user", "admin")
  .option("password", "admin")
  .option("driver", "com.cloudera.hive.jdbc41.HS2Driver")
*  .mode("overwrite")*
  .save

This error is weird, looks like the third-party Hive server isn't able to
recognize the SQL dialect coming from  [Spark Standalone] server JDBC
driver.

1) I would try to execute the create statement manually in this server
2) if works try to run again with "append" option

I would open a case with Cloudera and ask which driver you should use.

Thanks



On Mon, Jul 19, 2021 at 10:33 AM Artemis User 
wrote:

> As Mich mentioned, no need to use jdbc API, using the DataFrameWriter's
> saveAsTable method is the way to go.   JDBC Driver is for a JDBC client (a
> Java client for instance) to access the Hive tables in Spark via the Thrift
> server interface.
>
> -- ND
>
> On 7/19/21 2:42 AM, Badrinath Patchikolla wrote:
>
> I have trying to create table in hive from spark itself,
>
> And using local mode it will work what I am trying here is from spark
> standalone I want to create the manage table in hive (another spark cluster
> basically CDH) using jdbc mode.
>
> When I try that below are the error I am facing.
>
> On Thu, 15 Jul, 2021, 9:55 pm Mich Talebzadeh, 
> wrote:
>
>> Have you created that table in Hive or are you trying to create it from
>> Spark itself.
>>
>> You Hive is local. In this case you don't need a JDBC connection. Have
>> you tried:
>>
>> df2.write.mode("overwrite").saveAsTable(mydb.mytable)
>>
>> HTH
>>
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 15 Jul 2021 at 12:51, Badrinath Patchikolla <
>> pbadrinath1...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Trying to write data in spark to the hive as JDBC mode below  is the
>>> sample code:
>>>
>>> spark standalone 2.4.7 version
>>>
>>> 21/07/15 08:04:07 WARN util.NativeCodeLoader: Unable to load
>>> native-hadoop library for your platform... using builtin-java classes where
>>> applicable
>>> Setting default log level to "WARN".
>>> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
>>> setLogLevel(newLevel).
>>> Spark context Web UI available at http://localhost:4040
>>> Spark context available as 'sc' (master = spark://localhost:7077, app id
>>> = app-20210715080414-0817).
>>> Spark session available as 'spark'.
>>> Welcome to
>>>     __
>>>  / __/__  ___ _/ /__
>>> _\ \/ _ \/ _ `/ __/  '_/
>>>/___/ .__/\_,_/_/ /_/\_\   version 2.4.7
>>>   /_/
>>>
>>> Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_292)
>>> Type in expressions to have them evaluated.
>>> Type :help for more information.
>>>
>>> scala> :paste
>>> // Entering paste mode (ctrl-D to finish)
>>>
>>> val df = Seq(
>>> ("John", "Smith", "London"),
>>> ("David", "Jones", "India"),
>>> ("Michael", "Johnson", "Indonesia"),
>>> ("Chris", "Lee", "Brazil"),
>>> ("Mike", "Brown", "Russia")
>>>   ).toDF("first_name", "last_name", "country")
>>>
>>>
>>>  df.write
>>>   .format("jdbc")
>>>   .option("url",
>>> "jdbc:hive2://localhost:1/foundation;AuthMech=2;UseNativeQuery=0")
>>>   .option("dbtable", "test.test")
>>>   .option("user", "admin")
>>>   .option("password", "admin")
>>>   .option("driver", "com.cloudera.hive.jdbc41.HS2Driver")
>>>   .mode("overwrite")
>>>   .save
>>>
>>>
>>> // Exiting paste mode, now interpreting.
>>>
>>> java.sql.SQLException: [Cloudera][HiveJDBCDriver](500051) ERROR
>>> processing query/statement. Error Code: 4, SQL state:
>>> TStatus(statusCode:ERROR_STATUS,
>>> infoMessages:[*org.apache.hive.service.cli.HiveSQLException:Error while
>>> compiling statement: FAILED: ParseException line 1:39 cannot recognize
>>> input near '"first_name"' 'TEXT' ',' in column name or primary key or
>>> foreign key:28:27,
>>> org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:329,
>>> org.apache.hive.service.cli.operation.SQLOperation:prepare:SQLOperation.java:207,
>>> org.apache.hive.service.cli.operation.SQLOperation:runInternal:SQLOperation.java:290,

Re: [apache spark] Does Spark 2.4.8 have issues with ServletContextHandler

2021-06-14 Thread Daniel de Oliveira Mantovani
Did you include Apache Spark dependencies in your build? if you did, you
should remove it. If you are using sbt, all spark dependencies should be as
"provided".

On Wed, Jun 2, 2021 at 10:11 AM Kanchan Kauthale <
kanchankauthal...@gmail.com> wrote:

> Hello Sean,
>
> Please find below the stack trace-
>
> java.lang.NoclassDefFoundError: Could not initialize class
> org.spark.project.jetty.servlet.ServletContextHandler
>  at
> org.apache.spark.ui.JettyUtils$.createServletHandler(JettyUtils.scala:143)
>  at
> org.apache.spark.ui.JettyUtils$.createServletHandler(JettyUtils.scala:130)
>  at org.apache.spark.ui.WebUI.attachPage(WebUI.scala:89)
>  at org.apache.spark.ui.WebUI$$anonfun$attachTab$1.apply(WebUI.scala:71)
>  at org.apache.spark.ui.WebUI$$anonfun$attachTab$1.apply(WebUI.scala:71)
>  at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>  at org.apache.spark.ui.WebUI.attachTab(WebUI.scala:71)
>  at org.apache.spark.ui.SparkUI.initialize(SparkUI.scala:62)
>  at org.apache.spark.ui.SparkUI.(SparkUI.scala:80)
>  at org.apache.spark.ui.SparkUI$.create(SparkUI.scala:178)
>  at org.apache.spark.SparkContext.(SparkContext.scala:444)
>  at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2526)
>  at org.apache.spark.SparkContext.getOrCreate(SparkContext.scala)
>
> Any help regarding this, much appreciated.
> Thank you
> Kanchan Kauthale
>
> On Tue, Jun 1, 2021 at 12:49 PM Kanchan Kauthale <
> kanchankauthal...@gmail.com> wrote:
>
>> Thank you Sean for having look at my query. I am checking environment at
>> my end.
>>
>> The code belong to my organization environment and I don't have it in my
>> local environment so waiting for confirmation from compliance team to get
>> the error and stack trace out of organization environment.
>>
>> I will get back to you.
>>
>> Thank you again
>> Kanchan Kauthale
>>
>> On Thu, May 27, 2021, 18:52 Sean Owen  wrote:
>>
>>> Despite the name, the error doesn't mean the class isn't found but could
>>> not be initialized. What's the rest of the error?
>>> I don't believe any testing has ever encountered this error, so it's
>>> likely something to do with your environment, but I don't know what.
>>>
>>> On Thu, May 27, 2021 at 7:32 AM Kanchan Kauthale <
>>> kanchankauthal...@gmail.com> wrote:
>>>
 Hello,

 We have an existing project which works fine with Spark 2.4.7. We want
 to upgrade the spark version to 2.4.8. Scala version we are using is- 2.11
 After building with upgraded pom, we are getting error below for test 
 cases-

 java.lang.NoClassDefFoundError: Could not initialize class
 org.spark_object.jetty.servlet.ServletContextHandler
 at
 org.apache.spark.ui.JettyUtils$.createServletHandler(JettyUtils.scala:143)

 When I checked in Maven dependencies, I could find the
 ServletContextHandlerclass under spark-core_2.11-2.4.8.jar in given package
 hierarchy. I have compiled code, dependencies have been resolved, classpath
 is updated.

 Any hint regarding this error would help.


 Thank you

 Kanchan

>>>

-- 

--
Daniel Mantovani


Re: How to submit a job via REST API?

2020-12-18 Thread Daniel de Oliveira Mantovani
In my opinion this should be part of the official documentation. Amazing
work Zhou Yang.

On Wed, Nov 25, 2020 at 5:45 AM Zhou Yang  wrote:

> Hi all,
>
> I found the solution through the source code. Appending the —conf k-v into
> `sparkProperties` work.
> For example:
>
> ./spark-submit \
> —conf foo=bar \
> xxx
>
> equals to
>
> {
> “xxx” : “yyy”,
> “sparkProperties” : {
> “foo": "bar"
> }
> }
>
> Thanks for your reply.
>
> 2020年11月25日 下午3:55,vaquar khan  写道:
>
> Hi Yang,
>
> Please find following link
>
>
> https://stackoverflow.com/questions/63677736/spark-application-as-a-rest-service/63678337#63678337
>
> Regards,
> Vaquar khan
>
> On Wed, Nov 25, 2020 at 12:40 AM Sonal Goyal 
> wrote:
>
>> You should be able to supply the --conf and its values as part of appArgs
>> argument
>>
>> Cheers,
>> Sonal
>> Nube Technologies 
>> Join me at
>> Data Con LA Oct 23 | Big Data Conference Europe. Nov 24 | GIDS AI/ML Dec
>> 3
>>
>>
>>
>>
>> On Tue, Nov 24, 2020 at 11:31 AM Dennis Suhari <
>> d.suh...@icloud.com.invalid> wrote:
>>
>>> Hi Yang,
>>>
>>> I am using Livy Server for submitting jobs.
>>>
>>> Br,
>>>
>>> Dennis
>>>
>>>
>>>
>>> Von meinem iPhone gesendet
>>>
>>> Am 24.11.2020 um 03:34 schrieb Zhou Yang :
>>>
>>> 
>>> Dear experts,
>>>
>>> I found a convenient way to submit job via Rest API at
>>> https://gist.github.com/arturmkrtchyan/5d8559b2911ac951d34a#file-submit_job-sh
>>> .
>>> But I did not know whether can I append `—conf` parameter like what I
>>> did in spark-submit. Can someone can help me with this issue?
>>>
>>> *Regards, Yang*
>>>
>>>
>
> --
> Regards,
> Vaquar Khan
> +1 -224-436-0783
> Greater Chicago
>
>
>

-- 

--
Daniel Mantovani


Is possible to give options when reading semistructured files using SQL Syntax?

2020-07-28 Thread Daniel de Oliveira Mantovani
Is possible to give options when reading semistructured files using SQL
Syntax like in the example below:

"SELECT * FROM csv.`file.csv`

For example, if I want to have header=true. Is it possible ?

Thanks

-- 

--
Daniel Mantovani


Re: How To Access Hive 2 Through JDBC Using Kerberos

2020-07-17 Thread Daniel de Oliveira Mantovani
Sorry for the misunderstanding, I found out today that actually my
colleagues didn't make Spark work with Kerberos authentication for Hive
JDBC.

Spark can't give Kerberos parameters to the executors.

Sorry again for the misunderstanding.

On Thu, Jul 9, 2020 at 9:52 PM Jeff Evans 
wrote:

> There are various sample JDBC URLs documented here, depending on the
> driver vendor, Kerberos (or not), and SSL (or not).  Often times,
> unsurprisingly, SSL is used in conjunction with Kerberos.  Even if you
> don't use StreamSets software at all, you might find these useful.
>
>
> https://ask.streamsets.com/question/7/how-do-you-configure-a-hive-impala-jdbc-driver-for-data-collector/?answer=8#post-id-8
>
> On Thu, Jul 9, 2020 at 11:28 AM Daniel de Oliveira Mantovani <
> daniel.oliveira.mantov...@gmail.com> wrote:
>
>> One of my colleagues found this solution:
>>
>>
>> https://github.com/morfious902002/impala-spark-jdbc-kerberos/blob/master/src/main/java/ImpalaSparkJDBC.java
>>
>> If you need to connect to Hive or Impala using JDBC with Kerberos
>> authentication from Apache Spark, you can use it and will work.
>>
>> You can download the driver from Cloudera here:
>> https://www.cloudera.com/downloads/connectors/hive/jdbc/2-6-1.html
>>
>>
>>
>> On Tue, Jul 7, 2020 at 12:03 AM Daniel de Oliveira Mantovani <
>> daniel.oliveira.mantov...@gmail.com> wrote:
>>
>>> Hello Gabor,
>>>
>>> I meant, third-party connector* not "connection".
>>>
>>> Thank you so much!
>>>
>>> On Mon, Jul 6, 2020 at 1:09 PM Gabor Somogyi 
>>> wrote:
>>>
>>>> Hi Daniel,
>>>>
>>>> I'm just working on the developer API where any custom JDBC connection
>>>> provider(including Hive) can be added.
>>>> Not sure what you mean by third-party connection but AFAIK there is no
>>>> workaround at the moment.
>>>>
>>>> BR,
>>>> G
>>>>
>>>>
>>>> On Mon, Jul 6, 2020 at 12:09 PM Daniel de Oliveira Mantovani <
>>>> daniel.oliveira.mantov...@gmail.com> wrote:
>>>>
>>>>> Hello List,
>>>>>
>>>>> Is it possible to access Hive 2 through JDBC with Kerberos
>>>>> authentication from Apache Spark JDBC interface ? If it's possible do you
>>>>> have an example ?
>>>>>
>>>>> I found this tickets on JIRA:
>>>>> https://issues.apache.org/jira/browse/SPARK-12312
>>>>> https://issues.apache.org/jira/browse/SPARK-31815
>>>>>
>>>>> Do you know if there's a workaround for this ? Maybe using a
>>>>> third-party connection ?
>>>>>
>>>>> Thank you so much
>>>>> --
>>>>>
>>>>> --
>>>>> Daniel Mantovani
>>>>>
>>>>>
>>>
>>> --
>>>
>>> --
>>> Daniel Mantovani
>>>
>>>
>>
>> --
>>
>> --
>> Daniel Mantovani
>>
>>

-- 

--
Daniel Mantovani


Re: How To Access Hive 2 Through JDBC Using Kerberos

2020-07-09 Thread Daniel de Oliveira Mantovani
One of my colleagues found this solution:

https://github.com/morfious902002/impala-spark-jdbc-kerberos/blob/master/src/main/java/ImpalaSparkJDBC.java

If you need to connect to Hive or Impala using JDBC with Kerberos
authentication from Apache Spark, you can use it and will work.

You can download the driver from Cloudera here:
https://www.cloudera.com/downloads/connectors/hive/jdbc/2-6-1.html



On Tue, Jul 7, 2020 at 12:03 AM Daniel de Oliveira Mantovani <
daniel.oliveira.mantov...@gmail.com> wrote:

> Hello Gabor,
>
> I meant, third-party connector* not "connection".
>
> Thank you so much!
>
> On Mon, Jul 6, 2020 at 1:09 PM Gabor Somogyi 
> wrote:
>
>> Hi Daniel,
>>
>> I'm just working on the developer API where any custom JDBC connection
>> provider(including Hive) can be added.
>> Not sure what you mean by third-party connection but AFAIK there is no
>> workaround at the moment.
>>
>> BR,
>> G
>>
>>
>> On Mon, Jul 6, 2020 at 12:09 PM Daniel de Oliveira Mantovani <
>> daniel.oliveira.mantov...@gmail.com> wrote:
>>
>>> Hello List,
>>>
>>> Is it possible to access Hive 2 through JDBC with Kerberos
>>> authentication from Apache Spark JDBC interface ? If it's possible do you
>>> have an example ?
>>>
>>> I found this tickets on JIRA:
>>> https://issues.apache.org/jira/browse/SPARK-12312
>>> https://issues.apache.org/jira/browse/SPARK-31815
>>>
>>> Do you know if there's a workaround for this ? Maybe using a third-party
>>> connection ?
>>>
>>> Thank you so much
>>> --
>>>
>>> --
>>> Daniel Mantovani
>>>
>>>
>
> --
>
> --
> Daniel Mantovani
>
>

-- 

--
Daniel Mantovani


Re: Is it possible to use Hadoop 3.x and Hive 3.x using spark 2.4?

2020-07-06 Thread Daniel de Oliveira Mantovani
Hi Teja,

To access Hive 3 using Apache Spark 2.x.x you need to use this connector
from Cloudera
https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/integrating-hive/content/hive_hivewarehouseconnector_for_handling_apache_spark_data.html
.
It has many limitations You just can write to Hive managed tables in
ORC format. But you can mitigate this problem writing to Hive unmanaged
tables, so parquet will work.
The performance is also not the same.

Good luck


On Mon, Jul 6, 2020 at 3:16 PM Sean Owen  wrote:

> 2.4 works with Hadoop 3 (optionally) and Hive 1. I doubt it will work
> connecting to Hadoop 3 / Hive 3; it's possible in a few cases.
> It's also possible some vendor distributions support this combination.
>
> On Mon, Jul 6, 2020 at 7:51 AM Teja  wrote:
> >
> > We use spark 2.4.0 to connect to Hadoop 2.7 cluster and query from Hive
> > Metastore version 2.3. But the Cluster managing team has decided to
> upgrade
> > to Hadoop 3.x and Hive 3.x. We could not migrate to spark 3 yet, which is
> > compatible with Hadoop 3 and Hive 3, as we could not test if anything
> > breaks.
> >
> > *Is there any possible way to stick to spark 2.4.x version and still be
> able
> > to use Hadoop 3 and Hive 3?
> > *
> >
> > I got to know backporting is one option but I am not sure how. It would
> be
> > great if you could point me in that direction.
> >
> >
> >
> > --
> > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

-- 

--
Daniel Mantovani


Re: How To Access Hive 2 Through JDBC Using Kerberos

2020-07-06 Thread Daniel de Oliveira Mantovani
Hello Gabor,

I meant, third-party connector* not "connection".

Thank you so much!

On Mon, Jul 6, 2020 at 1:09 PM Gabor Somogyi 
wrote:

> Hi Daniel,
>
> I'm just working on the developer API where any custom JDBC connection
> provider(including Hive) can be added.
> Not sure what you mean by third-party connection but AFAIK there is no
> workaround at the moment.
>
> BR,
> G
>
>
> On Mon, Jul 6, 2020 at 12:09 PM Daniel de Oliveira Mantovani <
> daniel.oliveira.mantov...@gmail.com> wrote:
>
>> Hello List,
>>
>> Is it possible to access Hive 2 through JDBC with Kerberos authentication
>> from Apache Spark JDBC interface ? If it's possible do you have an example ?
>>
>> I found this tickets on JIRA:
>> https://issues.apache.org/jira/browse/SPARK-12312
>> https://issues.apache.org/jira/browse/SPARK-31815
>>
>> Do you know if there's a workaround for this ? Maybe using a third-party
>> connection ?
>>
>> Thank you so much
>> --
>>
>> --
>> Daniel Mantovani
>>
>>

-- 

--
Daniel Mantovani


How To Access Hive 2 Through JDBC Using Kerberos

2020-07-06 Thread Daniel de Oliveira Mantovani
Hello List,

Is it possible to access Hive 2 through JDBC with Kerberos authentication
from Apache Spark JDBC interface ? If it's possible do you have an example ?

I found this tickets on JIRA:
https://issues.apache.org/jira/browse/SPARK-12312
https://issues.apache.org/jira/browse/SPARK-31815

Do you know if there's a workaround for this ? Maybe using a third-party
connection ?

Thank you so much
-- 

--
Daniel Mantovani


Re: how to use cluster sparkSession like localSession

2018-11-01 Thread Daniel de Oliveira Mantovani
Please, read about Spark Streaming or Spark Structured Streaming. Your web
application can easily communicate through some API and you won’t have the
overhead of start a new spark job, which is pretty heavy.

On Thu, Nov 1, 2018 at 23:01 崔苗(数据与人工智能产品开发部) <0049003...@znv.com> wrote:

>
> Hi,
> we want to execute spark code with out submit application.jar,like this
> code:
>
> public static void main(String args[]) throws Exception{
> SparkSession spark = SparkSession
> .builder()
> .master("local[*]")
> .appName("spark test")
> .getOrCreate();
>
> Dataset testData =
> spark.read().csv(".\\src\\main\\java\\Resources\\no_schema_iris.scv");
> testData.printSchema();
> testData.show();
> }
>
> the above code can work well with idea , do not need to generate jar file
> and submit , but if we replace master("local[*]") with master("yarn") ,
> it can't work , so is there a way to use cluster sparkSession like local
> sparkSession ?  we need to dynamically execute spark code in web server
> according to the different request ,  such as filter request will call
> dataset.filter() , so there is no application.jar to submit .
>
> 0049003208
> 0049003...@znv.com
>
> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1=0049003208=0049003208%40znv.com=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png=%5B%220049003208%40znv.com%22%5D>
> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail81> 定制
> - To
> unsubscribe e-mail: user-unsubscr...@spark.apache.org

-- 

--
Daniel de Oliveira Mantovani
Perl Evangelist/Data Hacker
+1 786 459 1341


Re: Getting Message From Structured Streaming Format Kafka

2017-12-01 Thread Daniel de Oliveira Mantovani
Hello Burak,

Sorry to the delayed answer, you were right.

1) -  I change the sql-kafka connector version and fixed.
2) - The propose was just test, and I was using normal streaming also for
other thing.

I'm was wondering how did you know was the sql-kafka connector version
reading the logs. I Couldn't find anything useful there.

Thank you very much!

On Thu, Nov 2, 2017 at 12:04 PM, Burak Yavuz <brk...@gmail.com> wrote:

> Hi Daniel,
>
> Several things:
>  1) Your error seems to suggest you're using a different version of Spark
> and a different version of the sql-kafka connector. Could you make sure
> they are on the same Spark version?
>  2) With Structured Streaming, you may remove everything related to a
> StreamingContext.
>
> val sparkConf = new SparkConf().setMaster(master).setAppName(appName)
> sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
>
> val ssc = new StreamingContext(sparkConf, batchDuration)
> ssc.checkpoint(checkpointDir)
> ssc.remember(Minutes(1))
>
> These lines are not doing anything for Structured Streaming.
>
>
> Best,
> Burak
>
> On Thu, Nov 2, 2017 at 11:36 AM, Daniel de Oliveira Mantovani <
> daniel.oliveira.mantov...@gmail.com> wrote:
>
>> Hello, I'm trying to run the following code,
>>
>> var newContextCreated = false // Flag to detect whether new context was 
>> created or not
>> val kafkaBrokers = "localhost:9092" // comma separated list of broker:host
>>
>> private val batchDuration: Duration = Seconds(3)
>> private val master: String = "local[2]"
>> private val appName: String = this.getClass().getSimpleName()
>> private val checkpointDir: String = "/tmp/spark-streaming-amqp-tests"
>>
>> // Create a Spark configuration
>>
>> val sparkConf = new SparkConf().setMaster(master).setAppName(appName)
>> sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
>>
>> val ssc = new StreamingContext(sparkConf, batchDuration)
>> ssc.checkpoint(checkpointDir)
>> ssc.remember(Minutes(1)) // To make sure data is not deleted by the time we 
>> query it interactively
>>
>> val spark = SparkSession
>>   .builder
>>   .config(sparkConf)
>>   .getOrCreate()
>>
>> val lines = spark
>>   .readStream
>>   .format("kafka")
>>   .option("kafka.bootstrap.servers", "localhost:9092")
>>   .option("subscribe", "evil_queue")
>>   .load()
>>
>> lines.printSchema()
>>
>> import spark.implicits._
>> val noAggDF = lines.select("key")
>>
>> noAggDF
>>   .writeStream
>>   .format("console")
>>   .start()
>>
>>
>> But I'm having the error:
>>
>> http://paste.scsys.co.uk/565658
>>
>>
>> How do I get my messages using kafka as format from Structured Streaming ?
>>
>>
>> Thank you
>>
>>
>> --
>>
>> --
>> Daniel de Oliveira Mantovani
>> Perl Evangelist/Data Hacker
>> +1 786 459 1341 <(786)%20459-1341>
>>
>
>


-- 

--
Daniel de Oliveira Mantovani
Perl Evangelist/Data Hacker
+1 786 459 1341


Getting Message From Structured Streaming Format Kafka

2017-11-02 Thread Daniel de Oliveira Mantovani
Hello, I'm trying to run the following code,

var newContextCreated = false // Flag to detect whether new context
was created or not
val kafkaBrokers = "localhost:9092" // comma separated list of broker:host

private val batchDuration: Duration = Seconds(3)
private val master: String = "local[2]"
private val appName: String = this.getClass().getSimpleName()
private val checkpointDir: String = "/tmp/spark-streaming-amqp-tests"

// Create a Spark configuration

val sparkConf = new SparkConf().setMaster(master).setAppName(appName)
sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true")

val ssc = new StreamingContext(sparkConf, batchDuration)
ssc.checkpoint(checkpointDir)
ssc.remember(Minutes(1)) // To make sure data is not deleted by the
time we query it interactively

val spark = SparkSession
  .builder
  .config(sparkConf)
  .getOrCreate()

val lines = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("subscribe", "evil_queue")
  .load()

lines.printSchema()

import spark.implicits._
val noAggDF = lines.select("key")

noAggDF
  .writeStream
  .format("console")
  .start()


But I'm having the error:

http://paste.scsys.co.uk/565658


How do I get my messages using kafka as format from Structured Streaming ?


Thank you


-- 

--
Daniel de Oliveira Mantovani
Perl Evangelist/Data Hacker
+1 786 459 1341


Getting RabbitMQ Message Delivery Tag (Stratio/spark-rabbitmq)

2017-10-30 Thread Daniel de Oliveira Mantovani
Hello,

I'm using Stratio/spark-rabbitmq to read messages from RabbitMQ and save to
Kafka, and I just want "commit" the RabbitMQ message when it's safe on
Kafka's broker. For efficiency propose I'm using Kafka buffer and a call
back object, which should has the RabbitMQ message Delivery Tag to
acknowledge proper on RabbitMQ.

I couldn't find an interface on Stratio/spark-rabbitmq to get the delivery
tag:

val stream = createRabbitStream(ssc);


stream.foreachRDD( rdd => {
  rdd.foreachPartition( partition => {
val kafkaOpTopic = "evil_queue"
val broker = getKafka()
partition.foreach( record => {
  val data = record.toString
  val message = new ProducerRecord[String, String](kafkaOpTopic, null, data)
  broker.send(message);
} )
broker.close()
  })
})


Basic, the  variable "record" contains the message it self and not an
object with the RabbitMQ message structure, which would include the
delivery tag. I really need the delivery tag to write an efficient and safe
reader.
Someone knows how to get the delivery tag ? Or should I use other library
to read from RabbitMQ ?

Thank you



-- 

--
Daniel de Oliveira Mantovani
Perl Evangelist/Data Hacker
+1 786 459 1341