unsubscribe

2024-05-01 Thread Nebi Aydin
unsubscribe


unsubscribe

2024-05-01 Thread Atakala Selam
unsubscribe


unsubscribe

2024-05-01 Thread Yoel Benharrous
unsubscribe


Traceback is missing content in pyspark when invoked with UDF

2024-05-01 Thread Indivar Mishra
Hi

*Tl;Dr:* I have a scenario where I generate code string on fly and execute
that code, now for me if an error occurs I need the traceback but for
executable code I just get partial traceback i.e. the line which caused the
error is missing.

Consider below MRC:
def fun():
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType

spark = SparkSession.builder.appName("some_name").getOrCreate()

columns = ["Seqno", "Name"]
data = [("1", "john jones"), ("2", "tracey smith"), ("3", "amy sanders"
)]

df = spark.createDataFrame(data=data, schema=columns)

def errror_func(str):
def internal_error_method():
raise RuntimeError

return internal_error_method()

# Converting function to UDF
errror_func_udf = udf(lambda z: errror_func(z), StringType())

df.select(col("Seqno"), errror_func_udf(col("Name")).alias("Name")).show
(truncate=False)

fun()


This gives below shown Traceback, (Notice we are also getting the line
content that caused error

> Traceback (most recent call last):
>
>   File "temp.py", line 28, in 
>
> fun()
>
>   File "temp.py", line 25, in fun
>
> df.select(col("Seqno"),
>> errror_func_udf(col("Name")).alias("Name")).show(truncate=False)
>
>   File
>> "/home/indivar/corridor/code/corridor-platforms/venv/lib/python3.8/site-packages/pyspark/sql/dataframe.py",
>> line 502, in show
>
> print(self._jdf.showString(n, int_truncate, vertical))
>
>   File
>> "/home/indivar/corridor/code/corridor-platforms/venv/lib/python3.8/site-packages/py4j/java_gateway.py",
>> line 1321, in __call__
>
> return_value = get_return_value(
>
>   File
>> "/home/indivar/corridor/code/corridor-platforms/venv/lib/python3.8/site-packages/pyspark/sql/utils.py",
>> line 117, in deco
>
> raise converted from None
>
> pyspark.sql.utils.PythonException:
>
>   An exception was thrown from the Python worker. Please see the stack
>> trace below.
>
> Traceback (most recent call last):
>
>   File "temp.py", line 23, in 
>
> errror_func_udf = udf(lambda z: errror_func(z), StringType())
>
>   File "temp.py", line 20, in errror_func
>
> return internal_error_method()
>
>   File "temp.py", line 18, in internal_error_method
>
> raise RuntimeError
>
> RuntimeError
>
>
>
But now if i run the same code by doing an exec i loose the traceback line
content although line number is there
import linecache

code = """
def fun():
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType

spark = SparkSession.builder.appName("some_name").getOrCreate()

columns = ["Seqno", "Name"]
data = [("1", "john jones"), ("2", "tracey smith"), ("3", "amy
sanders")]

df = spark.createDataFrame(data=data, schema=columns)

def errror_func(str):
def internal_error_method():
raise RuntimeError

return internal_error_method()

# Converting function to UDF
errror_func_udf = udf(lambda z: errror_func(z), StringType())

df.select(col("Seqno"),
errror_func_udf(col("Name")).alias("Name")).show(truncate=False)
"""


scope = {}
filename = ""
compiled_code = compile(code, filename, "exec")
if filename not in linecache.cache:
linecache.cache[filename] = (
len(scope),
None,
code.splitlines(keepends=True),
filename,
)
exec(compiled_code, scope, scope)
fun = scope["fun"]

fun()


Traceback of this code is

> Traceback (most recent call last):
>
>   File "temp.py", line 74, in 
>
> fun()
>
>   File "", line 23, in fun
>
>   File
>> "/home/indivar/corridor/code/corridor-platforms/venv/lib/python3.8/site-packages/pyspark/sql/dataframe.py",
>> line 502, in show
>
> print(self._jdf.showString(n, int_truncate, vertical))
>
>   File
>> "/home/indivar/corridor/code/corridor-platforms/venv/lib/python3.8/site-packages/py4j/java_gateway.py",
>> line 1321, in __call__
>
> return_value = get_return_value(
>
>   File
>> "/home/indivar/corridor/code/corridor-platforms/venv/lib/python3.8/site-packages/pyspark/sql/utils.py",
>> line 117, in deco
>
> raise converted from None
>
> pyspark.sql.utils.PythonException:
>
>   An exception was thrown from the Python worker. Please see the stack
>> trace below.
>
> Traceback (most recent call last):
>
>   File "", line 21, in 
>
>   File "", line 18, in errror_func
>
>   File "", line 16, in internal_error_method
>
> RuntimeError
>
>
> As you can see this has missing line content.

initially i thought this was a python issue, so i tried to do some reading,
python internally seems to be using linecache module to get content of
line, now when doing exec uptill python 3.12 python also had same issue
which they have fixed in python 3.13 [issue ref for details]: Support
multi-line error locations in traceback and other related improvements
(PEP-657, 3.11) · Issue #106922 · python/cpython (github.com)

Re: [spark-graphframes]: Generating incorrect edges

2024-05-01 Thread Mich Talebzadeh
Hi Steve,

Thanks for your statement. I tend to use uuid myself to avoid
collisions. This built-in function generates random IDs that are highly
likely to be unique across systems. My concerns are on edge so to speak. If
the Spark application runs for a very long time or encounters restarts, the
monotonically_increasing_id() sequence might restart from the beginning.
This could again cause duplicate IDs if other Spark applications are
running concurrently or if data is processed across multiple runs of the
same application..

HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Wed, 1 May 2024 at 01:22, Stephen Coy  wrote:

> Hi Mich,
>
> I was just reading random questions on the user list when I noticed that
> you said:
>
> On 25 Apr 2024, at 2:12 AM, Mich Talebzadeh 
> wrote:
>
> 1) You are using monotonically_increasing_id(), which is not
> collision-resistant in distributed environments like Spark. Multiple hosts
>can generate the same ID. I suggest switching to UUIDs (e.g.,
> uuid.uuid4()) for guaranteed uniqueness.
>
>
> It’s my understanding that the *Spark* `monotonically_increasing_id()`
> function exists for the exact purpose of generating a collision-resistant
> unique id across nodes on different hosts.
> We use it extensively for this purpose and have never encountered an issue.
>
> Are we wrong or are you thinking of a different (not Spark) function?
>
> Cheers,
>
> Steve C
>
>
>
>
> This email contains confidential information of and is the copyright of
> Infomedia. It must not be forwarded, amended or disclosed without consent
> of the sender. If you received this message by mistake, please advise the
> sender and delete all copies. Security of transmission on the internet
> cannot be guaranteed, could be infected, intercepted, or corrupted and you
> should ensure you have suitable antivirus protection in place. By sending
> us your or any third party personal details, you consent to (or confirm you
> have obtained consent from such third parties) to Infomedia’s privacy
> policy. http://www.infomedia.com.au/privacy-policy/
>