Re: UnspecifiedDistribution Error using AQE

2021-08-03 Thread Mich Talebzadeh
Hi,

There have been reports of errors coming out when the following is set.

spark.conf.set("spark.sql.adaptive.enabled", "true")

Some reported in this forum. Please search the email list for
=> spark.sql.adaptive.enabled as you have not specified the nature of your
query causing error with this configuration parameter is turned on. You may
get additional info.

HTH




   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 3 Aug 2021 at 19:37, Jesse Lord  wrote:

> Hello spark users,
>
>
>
> I have an error that I would like to report as a spark 3.1.1 bug but I do
> not know how to create a reproducible example. I can provide a full stack
> trace if desired but the most useful information seems to be
>
>
>
> E   py4j.protocol.Py4JJavaError: An error occurred while
> calling o3301.toJavaRDD.
>
> E   : java.lang.IllegalStateException:
> UnspecifiedDistribution does not have default partitioning.
>
> E   at
> org.apache.spark.sql.catalyst.plans.physical.UnspecifiedDistribution$.createPartitioning(partitioning.scala:52)
>
> E   at
> org.apache.spark.sql.execution.exchange.EnsureRequirements$.$anonfun$ensureDistributionAndOrdering$1(EnsureRequirements.scala:54)
>
>
>
> This error happens when I have spark.sql.adaptive.enabled=true but does
> not happen when I change to false. It happens for both one of my unit tests
> (~30 rows) and with production data. Another work-around is to cache the
> dataframe before calling the collect/toJSON statement.
>
>
>
> I was not able to find any information about this kind of error on the
> jira or from stackexchange. I was wondering if anyone has seen this error
> before related to AQE and has any suggestions for trying to report it.
>
>
>
> Thanks,
>
> Jesse
>
>
>


UnspecifiedDistribution Error using AQE

2021-08-03 Thread Jesse Lord
Hello spark users,

I have an error that I would like to report as a spark 3.1.1 bug but I do not 
know how to create a reproducible example. I can provide a full stack trace if 
desired but the most useful information seems to be

E   py4j.protocol.Py4JJavaError: An error occurred while 
calling o3301.toJavaRDD.
E   : java.lang.IllegalStateException: UnspecifiedDistribution 
does not have default partitioning.
E   at 
org.apache.spark.sql.catalyst.plans.physical.UnspecifiedDistribution$.createPartitioning(partitioning.scala:52)
E   at 
org.apache.spark.sql.execution.exchange.EnsureRequirements$.$anonfun$ensureDistributionAndOrdering$1(EnsureRequirements.scala:54)

This error happens when I have spark.sql.adaptive.enabled=true but does not 
happen when I change to false. It happens for both one of my unit tests (~30 
rows) and with production data. Another work-around is to cache the dataframe 
before calling the collect/toJSON statement.

I was not able to find any information about this kind of error on the jira or 
from stackexchange. I was wondering if anyone has seen this error before 
related to AQE and has any suggestions for trying to report it.

Thanks,
Jesse



Re: Unsubscribe

2021-08-03 Thread Howard Yang
Unsubscribe

Edward Wu  于2021年8月3日周二 下午4:15写道:

> Unsubscribe
>


Unsubscribe

2021-08-03 Thread Edward Wu
 Unsubscribe


Re: Reading the last line of each file in a set of text files

2021-08-03 Thread Artemis User
Assuming you are running Linux, an easy option would be just to use the 
Linux tail command to extract the last line (or last couple of lines) of 
a file and save them to a different file/directory, before feeding it to 
Spark.  It shouldn't be hard to write a shell script that executes tail 
on all files in a directory (or S3 bucket if using AWS CLI).  If you 
really want this kind of file preprocessing done in Spark, you will have 
to extend Spark's DataFrameReader API which may not be an easy task if 
you don't have experienced Scala developers.  Hope this helps...


-- ND

On 8/2/21 6:50 PM, Sayeh Roshan wrote:

Hi users,
Does anyone here has experience with written spark code that just read 
the last line of each text file in a directory, s3 bucket, etc?
I am looking for a solution that doesn’t require reading the whole 
file. I basically wonder whether you can create a data frame/Rdd using 
file seek. Not sure whether there is such a thing already available in 
spark.

Thank you very much in advance.



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Collecting list of errors across executors

2021-08-03 Thread Abdeali Kothari
You could create a custom accumulator using a linkedlist or so.

Some examples that could help:
https://towardsdatascience.com/custom-pyspark-accumulators-310f63ca3c8c
https://stackoverflow.com/questions/34798578/how-to-create-custom-list-accumulator-i-e-listint-int


On Tue, Aug 3, 2021 at 1:23 PM Sachit Murarka 
wrote:

> Hi Team,
>
> We are using rdd.foreach(lambda x : do_something(x))
>
> Our use case requires collecting of the error messages in a list which are
> coming up in the exception block of the method do_something.
> Since this will be running on executor , a global list won't work here. As
> the state needs to be shared among various executors, I thought of using
> Accumulator,
> but the accumulator uses only Integral values.
>
> Can someone please suggest how do I collect all errors in a list which are
> coming from all records of RDD.
>
> Thanks,
> Sachit Murarka
>


Collecting list of errors across executors

2021-08-03 Thread Sachit Murarka
Hi Team,

We are using rdd.foreach(lambda x : do_something(x))

Our use case requires collecting of the error messages in a list which are
coming up in the exception block of the method do_something.
Since this will be running on executor , a global list won't work here. As
the state needs to be shared among various executors, I thought of using
Accumulator,
but the accumulator uses only Integral values.

Can someone please suggest how do I collect all errors in a list which are
coming from all records of RDD.

Thanks,
Sachit Murarka