Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the data

2023-01-04 Thread Sean Owen
That does not appear to be the same input you used in your example. What is
the contents of test.csv?

On Wed, Jan 4, 2023 at 7:45 AM Saurabh Gulati 
wrote:

> Hi @Sean Owen 
> Probably the data is incorrect, and the source needs to fix it.
> But using python's csv parser returns the correct results.
>
> import csv
>
> with open("/tmp/test.csv") as c_file:
>
> csv_reader = csv.reader(c_file, delimiter=",")
> for row in csv_reader:
> print(row)
>
> ['a', 'b', 'c']
> ['1', '', ',see what "I did",\ni am still writing']
> ['2', '', 'abc']
>
> And also, I don't understand why there is a distinction in outputs from
> df.show() and df.select("c").show()
>
> Mvg/Regards
> Saurabh Gulati
> Data Platform
> --
> *From:* Sean Owen 
> *Sent:* 04 January 2023 14:25
> *To:* Saurabh Gulati 
> *Cc:* Mich Talebzadeh ; User <
> user@spark.apache.org>
> *Subject:* Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used
> within the data
>
> That input is just invalid as CSV for any parser. You end a quoted col
> without following with a col separator. What would the intended parsing be
> and how would it work?
>
> On Wed, Jan 4, 2023 at 4:30 AM Saurabh Gulati 
> wrote:
>
>
> @Sean Owen  Also see the example below with quotes
> feedback:
>
> "a","b","c"
> "1","",",see what ""I did"","
> "2","","abc"
>
>


Re: [EXTERNAL] Re: Re: Incorrect csv parsing when delimiter used within the data

2023-01-04 Thread Shay Elbaz
If you have found a parser that works, simply read the data as text files, 
apply the parser manually, and convert to DataFrame (if needed at all),

From: Saurabh Gulati 
Sent: Wednesday, January 4, 2023 3:45 PM
To: Sean Owen 
Cc: Mich Talebzadeh ; User 
Subject: [EXTERNAL] Re: Re: Incorrect csv parsing when delimiter used within 
the data


ATTENTION: This email originated from outside of GM.


Hi @Sean Owen
Probably the data is incorrect, and the source needs to fix it.
But using python's csv parser returns the correct results.

import csv

with open("/tmp/test.csv") as c_file:

csv_reader = csv.reader(c_file, delimiter=",")
for row in csv_reader:
print(row)

['a', 'b', 'c']
['1', '', ',see what "I did",\ni am still writing']
['2', '', 'abc']
And also, I don't understand why there is a distinction in outputs from 
df.show()​ and df.select("c").show()​

Mvg/Regards
Saurabh Gulati
Data Platform

From: Sean Owen 
Sent: 04 January 2023 14:25
To: Saurabh Gulati 
Cc: Mich Talebzadeh ; User 
Subject: Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within 
the data

That input is just invalid as CSV for any parser. You end a quoted col without 
following with a col separator. What would the intended parsing be and how 
would it work?

On Wed, Jan 4, 2023 at 4:30 AM Saurabh Gulati 
mailto:saurabh.gul...@fedex.com>> wrote:

@Sean Owen Also see the example below with quotes 
feedback:
"a","b","c"
"1","",",see what ""I did"","
"2","","abc"


Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the data

2023-01-04 Thread Saurabh Gulati
Hi @Sean Owen
Probably the data is incorrect, and the source needs to fix it.
But using python's csv parser returns the correct results.

import csv

with open("/tmp/test.csv") as c_file:

csv_reader = csv.reader(c_file, delimiter=",")
for row in csv_reader:
print(row)

['a', 'b', 'c']
['1', '', ',see what "I did",\ni am still writing']
['2', '', 'abc']
And also, I don't understand why there is a distinction in outputs from 
df.show()​ and df.select("c").show()​

Mvg/Regards
Saurabh Gulati
Data Platform

From: Sean Owen 
Sent: 04 January 2023 14:25
To: Saurabh Gulati 
Cc: Mich Talebzadeh ; User 
Subject: Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within 
the data

That input is just invalid as CSV for any parser. You end a quoted col without 
following with a col separator. What would the intended parsing be and how 
would it work?

On Wed, Jan 4, 2023 at 4:30 AM Saurabh Gulati 
mailto:saurabh.gul...@fedex.com>> wrote:

@Sean Owen Also see the example below with quotes 
feedback:
"a","b","c"
"1","",",see what ""I did"","
"2","","abc"


Re: Got Error Creating permanent view in Postgresql through Pyspark code

2023-01-04 Thread Stelios Philippou
Vajiha,

I believe that you might be confusing stuff ?
Permanent View in PSQL is a standard view.

Temp view or Global View is the Spark View that is internal for Spark.

Can we get a snippet of the code please.


On Wed, 4 Jan 2023 at 15:10, Vajiha Begum S A 
wrote:

>
> I have tried to Create a permanent view in Postgresql DB through Pyspark
> code, but I have received the below error message. Kindly help me to create
> a permanent view table in the database.How shall create permanent view
> using Pyspark code. Please do reply.
>
> *Error Message::*
> *Exception has occurred: Analysis Exception*
> Not allowed to create a permanent view `default`.`TEMP1` by referencing a
> temporary view TEMP_VIEW. Please create a temp view instead by CREATE TEMP
> VIEW
>
>
> Regards,
> Vajiha
> Research Analyst
> MW Solutions
>


Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the data

2023-01-04 Thread Sean Owen
That input is just invalid as CSV for any parser. You end a quoted col
without following with a col separator. What would the intended parsing be
and how would it work?

On Wed, Jan 4, 2023 at 4:30 AM Saurabh Gulati 
wrote:

>
> @Sean Owen  Also see the example below with quotes
> feedback:
>
> "a","b","c"
> "1","",",see what ""I did"","
> "2","","abc"
>
>


Got Error Creating permanent view in Postgresql through Pyspark code

2023-01-04 Thread Vajiha Begum S A
I have tried to Create a permanent view in Postgresql DB through Pyspark
code, but I have received the below error message. Kindly help me to create
a permanent view table in the database.How shall create permanent view
using Pyspark code. Please do reply.

*Error Message::*
*Exception has occurred: Analysis Exception*
Not allowed to create a permanent view `default`.`TEMP1` by referencing a
temporary view TEMP_VIEW. Please create a temp view instead by CREATE TEMP
VIEW


Regards,
Vajiha
Research Analyst
MW Solutions


[BUG?] How to handle with special characters or scape them on spark version 3.3.0?

2023-01-04 Thread Vieira, Thiago
Hello everyone,

I’ve already raised this question on stack overflow, but to be honest I truly 
believe this is a bug at new spark version, so I am also sending this email.

Previously I was using spark version 3.2.1 to read data from SAP database by 
JDBC connector, I had no issues to perform the following steps:

df_1 = spark.read.format("jdbc") \
.option("url", "URL_LINK") \
.option("dbtable", 'DATABASE."/ABC/TABLE"') \
.option("user", "USER_HERE") \
.option("password", "PW_HERE") \
.option("driver", "com.sap.db.jdbc.Driver") \
.load()

display(df_1)

df_2 = df_1.filter("`/ABC/COLUMN` = 'ID_HERE'")

display(df_2)


This code above runs as it should, returning expected rows.

Since I updated my spark version to 3.3.0, because I need to have the new 
trigger 'availableNow' (trigger from streaming process), this process above 
started to fail, does not run at all.

Please follow the error message bellow.


-

--

ParseExceptionTraceback (most recent

call last)

 in ()

  1 df_2 = df_1.filter("`/ABC/COLUMN` = 'ID_HERE'")

  2

> 3 display(df_2)



/databricks/python_shell/dbruntime/display.py in display(self,

input, *args, **kwargs)

 81 raise Exception('Triggers can only be

set for streaming queries.')

 82

---> 83 self.add_custom_display_data("table",

input._jdf)

 84

 85 elif isinstance(input, list):



/databricks/python_shell/dbruntime/display.py in

add_custom_display_data(self, data_type, data)

 34 def add_custom_display_data(self, data_type, data):

 35 custom_display_key = str(uuid.uuid4())

---> 36 return_code =

self.entry_point.addCustomDisplayData(custom_display_key,

data_type, data)

 37 ip_display({

 38 "application/vnd.databricks.v1+display":

custom_display_key,



/databricks/spark/python/lib/py4j-0.10.9.5-

src.zip/py4j/java_gateway.py in __call__(self, *args)

   1319

   1320 answer =

self.gateway_client.send_command(command)

-> 1321 return_value = get_return_value(

   1322 answer, self.gateway_client, self.target_id,

self.name)

   1323



/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)

200 # Hide where the exception came from that

shows a non-Pythonic

201 # JVM exception message.

--> 202 raise converted from None

203 else:

204 raise



ParseException:

[PARSE_SYNTAX_ERROR] Syntax error at or near '/': extra input

'/'(line 1, pos 0)



== SQL ==

/ABC/COLUMN

^^^

I've already tried to format in so many different ways, following the 
instructions on: https://spark.apache.org/docs/latest/sql-ref-literals.html . 
I've already tried to use function string by previously formatting the string, 
also tried raw string, but nothing seems to work as supposed.

Another important information, I've tried to create a dummy code for you to be 
able to replicate the issue, but when I create those tables with slashes 
'/ABC/TABLE' containing columns with slashes '/ABC/COLUMN' directly on pyspark, 
instead of using JDBC connector, it actually works, I was able to filter, so I 
believe this error is related to SQL / JDBC, I am not able to space special 
characters at spark 3.3.0 anymore.


Regards,

Thiago Vieira
Data Engineer

This e-mail and any attachments contain privileged and confidential information 
intended only for the use of the addressee(s). If you are not an intended 
recipient of this e-mail, you are hereby notified that any dissemination, 
copying or use of information within it is strictly prohibited. If you received 
this e-mail in error or without authorization, please notify us immediately by 
reply e-mail and delete the e-mail from your system. Thank you in advance.

Este correo electrónico y, en su caso, cualquier fichero anexo al mismo, 
contiene información de carácter confidencial exclusivamente dirigida a su(s) 
destinatario(s). En el caso de haber recibido este correo electrónico por 
error, se ruega notificar inmediatamente esta circunstancia mediante reenvío a 
la dirección electrónica del remitente y el borrado del mismo, y se informa que 
cualquier transmisión, copia o uso de esta información está estrictamente 
prohibida. Muchas gracias

Este e-mail e quaisquer anexos seus podem conter informação confidencial para 
uso exclusivo do destinatário. Se não for o destinatário, não deverá usar, 
distribuir ou copiar este e-mail, devendo proceder à sua eliminação e informar 
o emissor. Obrigado


Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the data

2023-01-04 Thread Saurabh Gulati
Hey guys, much appreciate your quick responses.

To answer your questions,
@Mich Talebzadeh We get data from multiple 
sources, and we don't have any control over what they put in. In this case the 
column is supposed to contain some feedback and it can also contain quoted 
strings.

@Sean Owen Also see the example below with quotes 
feedback:
"a","b","c"
"1","",",see what ""I did"","
"2","","abc"
Here if we don't escape with "

df = spark.read.option("multiLine", True).option("enforceSchema", 
False).option("header", True).csv(f"/tmp/test.csv")

df.show(100, False)

+---+++
|a  |b   |c   |
+---+++
|1  |null|",see what ""I did""|

+---+++

df.count()

1

So, we put in "? as the escape character and then its parsed fine but the count 
is wrong.


df = spark.read.option("escape", '"').option("multiLine", 
True).option("enforceSchema", False).option("header", 
True).csv(f"/tmp/test.csv")

df.show(100, False)

+---++--+
|a  |b   |c |
+---++--+
|1  |null|,see what "I did",|
|2  |null|abc   |
+---++--+

df.count()
1

I understand its a complex case or maybe an edge case which makes it difficult 
for spark
to understand when a column ends as we have even enabled multiline=True?.

See another example below which even has multiline value for column c?.

"a","b","c"
"1","",",see what ""I did"",
i am still writing"
"2","","abc"

# with escape

df = spark.read.option("escape", '"').option("multiLine", 
True).option("enforceSchema", False).option("header", 
True).csv(f"/tmp/test.csv")

df.show(10, False)

+---++--+
|a  |b   |c |
+---++--+
|1  |null|,see what "I did",\ni am still writing|
|2  |null|abc   |
+---++--+

df.count()
1

df.select("c").show(10, False)
+--+
|c |
+--+
|see what ""I did""|
|null  |
|abc   |
+--+

# without escape "


df.show(10, False)

+---+++
|a  |b   |c   |
+---+++
|1  |null|",see what ""I did""|
|i am still writing"|null|null|
|2  |null|abc |
+---+++

df.select("c").show(10, False)

++
|c   |
++
|",see what ""I did""|
|null|
|abc |
++


The issue is that it can print the complete data frame correctly with escape 
enabled,
but when you select a column or ask a count then it gives wrong output.


Regards
Saurabh

From: Mich Talebzadeh 
Sent: 04 January 2023 10:14
To: Sean Owen 
Cc: Saurabh Gulati ; User 
Subject: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the 
data

Caution! This email originated outside of FedEx. Please do not open attachments 
or click links from an unknown or suspicious origin.

What is the point of having  , as a column value? From a business point of view 
it does not signify anything IMO




 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile


 
https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Tue, 3 Jan 2023 at 20:39, Sean Owen 
mailto:sro...@gmail.com>> wrote:
Why does the data even need cleaning? That's all perfectly correct. The error 
was setting quote to be an escape char.

On Tue, Jan 3, 2023, 2:32 PM Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:
if you take your source CSV as below


"a","b","c"
"1","",","
"2","","abc"


and define your code as below


   csv_file="hdfs://rhes75:9000/data/stg/test/testcsv.csv"
# read hive table in spark
listing_df = 
spark.read.format("com.databricks.spark.csv").option("inferSchema", 
"true").option("header", 

Re: How to set a config for a single query?

2023-01-04 Thread Shay Elbaz
Hi Felipe,

I had the same problem - needed to execute multiple jobs/actions multithreaded, 
with slightly different sql configs per job (mainly 
spark.sql.shuffle.partitions). I'm not sure if this is the best solution, but I 
ended up using newSession() per thread. It works well except for the new 
SparkSession does not contain custom configurations from the original session. 
I had to re-apply the important configurations (catalogs, etc.) on the new 
Sessions as well. Hope that helps.

Shay

From: Saurabh Gulati 
Sent: Wednesday, January 4, 2023 11:54 AM
To: Felipe Pessoto ; user@spark.apache.org 

Subject: [EXTERNAL] Re: How to set a config for a single query?


ATTENTION: This email originated from outside of GM.


Hey Felipe,
Since you are collecting the dataframes, you might as well run them separately 
with desired configs and store them in your storage.

Regards
Saurabh

From: Felipe Pessoto 
Sent: 04 January 2023 01:14
To: user@spark.apache.org 
Subject: [EXTERNAL] How to set a config for a single query?

Caution! This email originated outside of FedEx. Please do not open attachments 
or click links from an unknown or suspicious origin.


Hi,



In Scala is it possible to set a config value to a single query?



I could set/unset the value, but it won’t work for multithreading scenarios.



Example:



spark.sql.adaptive.coalescePartitions.enabled = false

queryA_df.collect

spark.sql.adaptive.coalescePartitions.enabled=original value

queryB_df.collect

queryC_df.collect

queryD_df.collect





If I execute that block of code multiple times using multiple thread, I can end 
up executing Query A with coalescePartitions.enabled=true, and Queries B, C and 
D with the config set to false, because another thread could set it between the 
executions.



Is there any good alternative to this?



Thanks.


Re: How to set a config for a single query?

2023-01-04 Thread Saurabh Gulati
Hey Felipe,
Since you are collecting the dataframes, you might as well run them separately 
with desired configs and store them in your storage.

Regards
Saurabh

From: Felipe Pessoto 
Sent: 04 January 2023 01:14
To: user@spark.apache.org 
Subject: [EXTERNAL] How to set a config for a single query?

Caution! This email originated outside of FedEx. Please do not open attachments 
or click links from an unknown or suspicious origin.


Hi,



In Scala is it possible to set a config value to a single query?



I could set/unset the value, but it won’t work for multithreading scenarios.



Example:



spark.sql.adaptive.coalescePartitions.enabled = false

queryA_df.collect

spark.sql.adaptive.coalescePartitions.enabled=original value

queryB_df.collect

queryC_df.collect

queryD_df.collect





If I execute that block of code multiple times using multiple thread, I can end 
up executing Query A with coalescePartitions.enabled=true, and Queries B, C and 
D with the config set to false, because another thread could set it between the 
executions.



Is there any good alternative to this?



Thanks.


Re: Incorrect csv parsing when delimiter used within the data

2023-01-04 Thread Mich Talebzadeh
What is the point of having  *,* as a column value? From a business point
of view it does not signify anything IMO



   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 3 Jan 2023 at 20:39, Sean Owen  wrote:

> Why does the data even need cleaning? That's all perfectly correct. The
> error was setting quote to be an escape char.
>
> On Tue, Jan 3, 2023, 2:32 PM Mich Talebzadeh 
> wrote:
>
>> if you take your source CSV as below
>>
>> "a","b","c"
>> "1","",","
>> "2","","abc"
>>
>>
>> and define your code as below
>>
>>
>>csv_file="hdfs://rhes75:9000/data/stg/test/testcsv.csv"
>> # read hive table in spark
>> listing_df =
>> spark.read.format("com.databricks.spark.csv").option("inferSchema",
>> "true").option("header", "true").load(csv_file)
>> listing_df.printSchema()
>> print(f"""\n Reading from Hive table {csv_file}\n""")
>> listing_df.show(100,False)
>> listing_df.select("c").show()
>>
>>
>> results in
>>
>>
>>  Reading from Hive table hdfs://rhes75:9000/data/stg/test/testcsv.csv
>>
>> +---++---+
>> |a  |b   |c  |
>> +---++---+
>> |1  |null|,  |
>> |2  |null|abc|
>> +---++---+
>>
>> +---+
>> |  c|
>> +---+
>> |  ,|
>> |abc|
>> +---+
>>
>>
>> which assumes that "," is a value for column c in row 1
>>
>>
>> This interpretation is correct. You ought to do data cleansing before.
>>
>>
>> HTH
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 3 Jan 2023 at 17:03, Sean Owen  wrote:
>>
>>> No, you've set the escape character to double-quote, when it looks like
>>> you mean for it to be the quote character (which it already is). Remove
>>> this setting, as it's incorrect.
>>>
>>> On Tue, Jan 3, 2023 at 11:00 AM Saurabh Gulati
>>>  wrote:
>>>
 Hello,
 We are seeing a case with csv data when it parses csv data incorrectly.
 The issue can be replicated using the below csv data

 "a","b","c"
 "1","",","
 "2","","abc"

 and using the spark csv read command.

 df = spark.read.format("csv")\
 .option("multiLine", True)\
 .option("escape", '"')\
 .option("enforceSchema", False) \
 .option("header", True)\
 .load(f"/tmp/test.csv")

 df.show(100, False) # prints both rows
 |a  |b   |c  |
 +---++---+
 |1  |null|,  |
 |2  |null|abc|

 df.select("c").show() # merges last column of first row and first
 column of second row
 +--+
 | c|
 +--+
 |"\n"2"|

 print(df.count()) # prints 1, should be 2


 It feels like a bug and I thought of asking the community before
 creating a bug on jira.

 Mvg/Regards
 Saurabh