Re: "java.lang.AssertionError: assertion failed: Failed to get records for **** after polling for 180000" error

2019-03-05 Thread Akshay Bhardwaj
Hi, To better debug the issue, please check the below config properties: - max.partition.fetch.bytes within spark kafka consumer. If not set for consumer then the global config at broker level. - spark.streaming.kafka.consumer.poll.ms - spark.network.timeout (If the above is not

Re: "java.lang.AssertionError: assertion failed: Failed to get records for **** after polling for 180000" error

2019-03-05 Thread Shyam P
Would be better if you share some code block to understand it better. Else would be difficult to provide answer. ~Shyam On Wed, Mar 6, 2019 at 8:38 AM JF Chen wrote: > When my kafka executor reads data from kafka, sometimes it throws the > error "java.lang.AssertionError: assertion failed:

Re: spark df.write.partitionBy run very slow

2019-03-05 Thread Shyam P
Hi JF, Yes first we should know actual number of partitions dataframe has and its counts of records. Accordingly we should try to have data evenly in all partitions. It always better to have Num of paritions = N * Num of executors. "But the sequence of columns in partitionBy decides the

Re: How to group dataframe year-wise and iterate through groups and send each year to dataframe to executor?

2019-03-05 Thread Shyam P
Thanks a lot Roman. But provided link as several ways to deal the problem. Why do we need to do operation on RDD instead dataframe/dataset ? Do I need a custom partitioner in my case , how to invoke it in spark-sql? Can anyone provide some sample on handling skewed data with spark-sql? Thanks,

"java.lang.AssertionError: assertion failed: Failed to get records for **** after polling for 180000" error

2019-03-05 Thread JF Chen
When my kafka executor reads data from kafka, sometimes it throws the error "java.lang.AssertionError: assertion failed: Failed to get records for after polling for 18" , which after 3 minutes of executing. The data waiting for read is not so huge, which is about 1GB. And other partitions

Re: spark df.write.partitionBy run very slow

2019-03-05 Thread JF Chen
Hi Shyam Thanks for your reply. You mean after knowing the partition number of column_a, column_b, column_c, the sequence of column in partitionBy should be same to the order of partitions number of column a, b and c? But the sequence of columns in partitionBy decides the directory hierarchy

Re: [SQL] 64-bit hash function, and seeding

2019-03-05 Thread Huon.Wilson
Hi Nicolas, On 6/3/19, 7:48 am, "Nicolas Paris" wrote: Hi Huon Good catch. A 64 bit hash is definitely a useful function. > the birthday paradox implies >50% chance of at least one for tables larger than 77000 rows Do you know how many rows to have 50% chances

Re: [SQL] 64-bit hash function, and seeding

2019-03-05 Thread Nicolas Paris
Hi Huon Good catch. A 64 bit hash is definitely a useful function. > the birthday paradox implies >50% chance of at least one for tables larger > than 77000 rows Do you know how many rows to have 50% chances for a 64 bit hash ? About the seed column, to me there is no need for such an

Re: Is there a way to validate the syntax of raw spark sql query?

2019-03-05 Thread kant kodali
Hi Akshay, Thanks for this. I will give it a try. The Java API for .explain returns void. It doesn't throw any checked exception. so I guess I have to catch the generic RuntimeException and walk through the stacktrace to see if there is any ParseException. In short, the code just gets really

Re: [Kubernets] [SPARK-27061] Need to expose 4040 port on driver service

2019-03-05 Thread Chandu Kavar
Thank you for the clarification. On Tue, Mar 5, 2019 at 11:59 PM Gabor Somogyi wrote: > Hi, > > It will be automatically assigned when one creates a PR. > > BR, > G > > > On Tue, Mar 5, 2019 at 4:51 PM Chandu Kavar wrote: > >> My Jira username is: *cckavar* >> >> On Tue, Mar 5, 2019 at 11:46

Re: [Kubernets] [SPARK-27061] Need to expose 4040 port on driver service

2019-03-05 Thread Gabor Somogyi
Hi, It will be automatically assigned when one creates a PR. BR, G On Tue, Mar 5, 2019 at 4:51 PM Chandu Kavar wrote: > My Jira username is: *cckavar* > > On Tue, Mar 5, 2019 at 11:46 PM Chandu Kavar wrote: > >> Hi Team, >> I have created a JIRA ticket to expose 4040 port on driver service.

Re: [Kubernets] [SPARK-27061] Need to expose 4040 port on driver service

2019-03-05 Thread Chandu Kavar
My Jira username is: *cckavar* On Tue, Mar 5, 2019 at 11:46 PM Chandu Kavar wrote: > Hi Team, > I have created a JIRA ticket to expose 4040 port on driver service. Also, > added the details in the ticket. > > I am aware of the changes that we need to make and want to assign this > task to my

[Kubernets] [SPARK-27061] Need to expose 4040 port on driver service

2019-03-05 Thread Chandu Kavar
Hi Team, I have created a JIRA ticket to expose 4040 port on driver service. Also, added the details in the ticket. I am aware of the changes that we need to make and want to assign this task to my self. I am not able to assign by myself. Can someone help me to assign this task to me. Thank

[PySpark] TypeError: expected string or bytes-like object

2019-03-05 Thread Thomas Ryck
I am using PySpark through JupyterLab using the Spark distibution provided with *conda install pyspark*. So I run Spark locally. I started using pyspark 2.4.0 but I had a Socket issue which I solved with downgrading the package to 2.3.2. So I am using pyspark 2.3.2 at the moment. I am trying to

Why does Apache Spark Master shutdown when Zookeeper expires the session

2019-03-05 Thread lokeshkumar
As I understand, Apache Spark Master can be run in high availability mode using Zookeeper. That is, multiple Spark masters can run in Leader/Follower mode and these modes are registered with Zookeeper. In our scenario Zookeeper is expiring the Spark Master's session which is acting as Leader. So

Re: Is there a way to validate the syntax of raw spark sql query?

2019-03-05 Thread Akshay Bhardwaj
Hi Kant, You can try "explaining" the sql query. spark.sql(sqlText).explain(true); //the parameter true is to get more verbose query plan and it is optional. This is the safest way to validate sql without actually executing/creating a df/view in spark. It validates syntax as well as schema of

Re: How to add more imports at the start of REPL

2019-03-05 Thread Nuthan Reddy
Ok, thanks. It does solve the problem. Nuthan Reddy Sigmoid Analytics On Tue, Mar 5, 2019 at 5:17 PM Jesús Vásquez wrote: > Hi Nuthan, I have had the same issue before. > As a shortcut i created a text file called imports and then loaded it's > content with the command :load of the scala

Re: How to add more imports at the start of REPL

2019-03-05 Thread Jesús Vásquez
Hi Nuthan, I have had the same issue before. As a shortcut i created a text file called imports and then loaded it's content with the command :load of the scala repl Example: Create "imports" textfile with the import instructions you need import org.apache.hadoop.conf.Configuration import

C++ script on Spark Cluster throws exit status 132

2019-03-05 Thread Mkal
I'm trying to run a c++ program on spark cluster by using the rdd.pipe() operation but the executors throw: java.lang.IllegalStateException: Subprocess exited with status 132. The spark jar runs totally fine on standalone and the c++ program runs just fine on its own as well. I've tried with

How to add more imports at the start of REPL

2019-03-05 Thread Nuthan Reddy
Hi, When launching the REPL using spark-submit, the following are loaded automatically. scala> :imports 1) import org.apache.spark.SparkContext._ (70 terms, 1 are implicit) 2) import spark.implicits._ (1 types, 67 terms, 37 are implicit) 3) import spark.sql (1 terms)

[no subject]

2019-03-05 Thread Shyam P
Hi All, I need to save a huge data frame as parquet file. As it is huge its taking several hours. To improve performance it is known I have to send it group wise. But when I do partition ( columns*) /groupBy(Columns*) , driver is spilling a lot of data and performance hits a lot again. So how