[Spark SQL] issue about diffrence in memory size between DataFrame and RDD

2020-04-19 Thread Lyx
Hello, I'm using Spark to deal with my project these days, however i noticed that when load data stored in Hadoop hdfs, it seems that there is a huge difference in JVM memory size between using DataFrame and using RDD format.Below lists my shell script when using spark-shell, my original

[Spark SQL] [Beginner] Dataset[Row] collect to driver throw java.io.EOFException: Premature EOF: no length prefix available

2020-04-19 Thread maqy1...@outlook.com
Hi all,  I get a Dataset[Row] through the following code: val df: Dataset[Row] = spark.read.format("csv).schema(schema).load("hdfs://master:9000/mydata")  After that I want to collect it to the driver: val df_rows: Array[Row] = df.collect()  The Spark web ui shows that all tasks have run

Re: Understanding spark structured streaming checkpointing system

2020-04-19 Thread Ruijing Li
It’s not intermittent, seems to happen everytime spark fails when it starts up from last checkpoint and complains the offset is old. I checked the offset and it is indeed true the offset expired from kafka side. My version of spark is 2.4.4 using kafka 0.10 On Sun, Apr 19, 2020 at 3:38 PM

Re: [Structured Streaming] Checkpoint file compact file grows big

2020-04-19 Thread Jungtaek Lim
Deleting the latest .compact file would lose the ability for exactly-once and lead Spark fail to read from the output directory. If you're reading the output directory from non-Spark then metadata on output directory doesn't matter, but there's no exactly-once (exactly-once is achieved leveraging

Re: Using startingOffsets latest - no data from structured streaming kafka query

2020-04-19 Thread Jungtaek Lim
Did you provide more records to topic "after" you started the query? That's the only one I can imagine based on such information. On Fri, Apr 17, 2020 at 9:13 AM Ruijing Li wrote: > Hi all, > > Apologies if this has been asked before, but I could not find the answer > to this question. We have

Re: Spark structured streaming - Fallback to earliest offset

2020-04-19 Thread Jungtaek Lim
You may want to check "where" the job is stuck via taking thread dump - it could be in kafka consumer, in Spark codebase, etc. Without the information it's hard to say. On Thu, Apr 16, 2020 at 4:22 PM Ruijing Li wrote: > Thanks Jungtaek, that makes sense. > > I tried Burak’s solution of just

Re: Understanding spark structured streaming checkpointing system

2020-04-19 Thread Jungtaek Lim
That sounds odd. Is it intermittent, or always reproducible if you starts with same checkpoint? What's the version of Spark? On Fri, Apr 17, 2020 at 6:17 AM Ruijing Li wrote: > Hi all, > > I have a question on how structured streaming does checkpointing. I’m > noticing that spark is not reading

Re: How to pass a constant value to a partitioned hive table in spark

2020-04-19 Thread Mich Talebzadeh
Many thanks Ayan. I tried that as well as follows: val broadcastValue = "123456789" // I assume this will be sent as a constant for the batch val df = spark.read. format("com.databricks.spark.xml"). option("rootTag", "hierarchy"). option("rowTag",