Thanks a lot Roman.
But provided link as several ways to deal the problem.
Why do we need to do operation on RDD instead dataframe/dataset ?
Do I need a custom partitioner in my case , how to invoke it in spark-sql?
Can anyone provide some sample on handling skewed data with spark-sql?
Thanks,
Hi JF,
Yes first we should know actual number of partitions dataframe has and its
counts of records. Accordingly we should try to have data evenly in all
partitions.
It always better to have Num of paritions = N * Num of executors.
"But the sequence of columns in partitionBy decides the
Would be better if you share some code block to understand it better.
Else would be difficult to provide answer.
~Shyam
On Wed, Mar 6, 2019 at 8:38 AM JF Chen wrote:
> When my kafka executor reads data from kafka, sometimes it throws the
> error "java.lang.AssertionError: assertion failed:
Hi,
To better debug the issue, please check the below config properties:
- max.partition.fetch.bytes within spark kafka consumer. If not set for
consumer then the global config at broker level.
- spark.streaming.kafka.consumer.poll.ms
- spark.network.timeout (If the above is not
Hi Nuthan, I have had the same issue before.
As a shortcut i created a text file called imports and then loaded it's
content with the command :load of the scala repl
Example:
Create "imports" textfile with the import instructions you need
import org.apache.hadoop.conf.Configuration
import
Hi,
When launching the REPL using spark-submit, the following are loaded
automatically.
scala> :imports
1) import org.apache.spark.SparkContext._ (70 terms, 1 are implicit)
2) import spark.implicits._ (1 types, 67 terms, 37 are implicit)
3) import spark.sql (1 terms)
I'm trying to run a c++ program on spark cluster by using the rdd.pipe()
operation but the executors throw: java.lang.IllegalStateException:
Subprocess exited with status 132.
The spark jar runs totally fine on standalone and the c++ program runs just
fine on its own as well. I've tried with
Hi All,
I need to save a huge data frame as parquet file. As it is huge its
taking several hours. To improve performance it is known I have to send it
group wise.
But when I do partition ( columns*) /groupBy(Columns*) , driver is spilling
a lot of data and performance hits a lot again.
So how
Ok, thanks. It does solve the problem.
Nuthan Reddy
Sigmoid Analytics
On Tue, Mar 5, 2019 at 5:17 PM Jesús Vásquez
wrote:
> Hi Nuthan, I have had the same issue before.
> As a shortcut i created a text file called imports and then loaded it's
> content with the command :load of the scala
Hi Kant,
You can try "explaining" the sql query.
spark.sql(sqlText).explain(true); //the parameter true is to get more
verbose query plan and it is optional.
This is the safest way to validate sql without actually executing/creating
a df/view in spark. It validates syntax as well as schema of
As I understand, Apache Spark Master can be run in high availability mode
using Zookeeper. That is, multiple Spark masters can run in Leader/Follower
mode and these modes are registered with Zookeeper.
In our scenario Zookeeper is expiring the Spark Master's session which is
acting as Leader. So
I am using PySpark through JupyterLab using the Spark distibution provided
with *conda install pyspark*. So I run Spark locally.
I started using pyspark 2.4.0 but I had a Socket issue which I solved with
downgrading the package to 2.3.2.
So I am using pyspark 2.3.2 at the moment.
I am trying to
Hi Akshay,
Thanks for this. I will give it a try. The Java API for .explain returns
void. It doesn't throw any checked exception. so I guess I have to catch
the generic RuntimeException and walk through the stacktrace to see if
there is any ParseException. In short, the code just gets really
Hi Team,
I have created a JIRA ticket to expose 4040 port on driver service. Also,
added the details in the ticket.
I am aware of the changes that we need to make and want to assign this task
to my self.
I am not able to assign by myself. Can someone help me to assign this task
to me.
Thank
Thank you for the clarification.
On Tue, Mar 5, 2019 at 11:59 PM Gabor Somogyi
wrote:
> Hi,
>
> It will be automatically assigned when one creates a PR.
>
> BR,
> G
>
>
> On Tue, Mar 5, 2019 at 4:51 PM Chandu Kavar wrote:
>
>> My Jira username is: *cckavar*
>>
>> On Tue, Mar 5, 2019 at 11:46
Hi,
It will be automatically assigned when one creates a PR.
BR,
G
On Tue, Mar 5, 2019 at 4:51 PM Chandu Kavar wrote:
> My Jira username is: *cckavar*
>
> On Tue, Mar 5, 2019 at 11:46 PM Chandu Kavar wrote:
>
>> Hi Team,
>> I have created a JIRA ticket to expose 4040 port on driver service.
My Jira username is: *cckavar*
On Tue, Mar 5, 2019 at 11:46 PM Chandu Kavar wrote:
> Hi Team,
> I have created a JIRA ticket to expose 4040 port on driver service. Also,
> added the details in the ticket.
>
> I am aware of the changes that we need to make and want to assign this
> task to my
Hi Huon
Good catch. A 64 bit hash is definitely a useful function.
> the birthday paradox implies >50% chance of at least one for tables larger
> than 77000 rows
Do you know how many rows to have 50% chances for a 64 bit hash ?
About the seed column, to me there is no need for such an
Hi Nicolas,
On 6/3/19, 7:48 am, "Nicolas Paris" wrote:
Hi Huon
Good catch. A 64 bit hash is definitely a useful function.
> the birthday paradox implies >50% chance of at least one for tables
larger than 77000 rows
Do you know how many rows to have 50% chances
Hi Shyam
Thanks for your reply.
You mean after knowing the partition number of column_a, column_b,
column_c, the sequence of column in partitionBy should be same to the order
of partitions number of column a, b and c?
But the sequence of columns in partitionBy decides the
directory hierarchy
When my kafka executor reads data from kafka, sometimes it throws the error
"java.lang.AssertionError: assertion failed: Failed to get records for
after polling for 18" , which after 3 minutes of executing.
The data waiting for read is not so huge, which is about 1GB. And other
partitions
21 matches
Mail list logo