Re: is RosckDB backend available in 3.0 preview?

2020-04-21 Thread Jungtaek Lim
Unfortunately, the short answer is no. Please refer the last part of discussion on the PR https://github.com/apache/spark/pull/24922 Unless we get any native implementation of this, I guess this project is most widely known implementation for RocksDB backend state store -

Re: What is the best way to take the top N entries from a hive table/data source?

2020-04-21 Thread Yeikel
Hi Zhang. Thank you for your response While your answer clarifies my confusion with `CollectLimit` it still does not clarify what is the recommended way to extract large amounts of data (but not all the records) from a source and maintain a high level of parallelism. For example , at some

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-21 Thread Jungtaek Lim
No, that's not a thing to apologize for. It's just your call - less context would bring less reaction and interest. On Wed, Apr 22, 2020 at 11:50 AM Ruijing Li wrote: > I apologize, but I cannot share it, even if it is just typical spark > libraries. I definitely understand that limits

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-21 Thread Ruijing Li
I apologize, but I cannot share it, even if it is just typical spark libraries. I definitely understand that limits debugging help, but wanted to understand if anyone has encountered a similar issue. On Tue, Apr 21, 2020 at 7:12 PM Jungtaek Lim wrote: > If there's no third party libraries in

is RosckDB backend available in 3.0 preview?

2020-04-21 Thread kant kodali
Hi All, 1. is RosckDB backend available in 3.0 preview? 2. if RocksDB can store intermediate results of a stream-stream join can I run streaming join queries forever? forever I mean until I run out of disk. or put it another can I run the stream-stream join queries for years if necessary

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-21 Thread Jungtaek Lim
If there's no third party libraries in the dump then why not share the thread dump? (I mean, the output of jstack) stack trace would be more helpful to find which thing acquired lock and which other things are waiting for acquiring lock, if we suspect deadlock. On Wed, Apr 22, 2020 at 2:38 AM

Spark Mongodb connector hangs indefinitely, not working on Amazon EMR

2020-04-21 Thread Daniel Stojanov
When running a Pyspark application on my local machine I am able to save and retrieve from the Mongodb server using the Mongodb Spark connector. All works properly. When submitting the exact same application on my Amazon EMR cluster I can see that the package for the Spark driver is being properly

Re: Spark Structure Streaming | FileStreamSourceLog not deleting list of input files | Spark -2.4.0

2020-04-21 Thread Jungtaek Lim
You're hitting an existing issue https://issues.apache.org/jira/browse/SPARK-17604. While there's no active PR to address it, I've been planning to take a look sooner than later. Btw, you may also want to take a look at my previous mail - the topic on the mail thread was regarding file stream

Re: Using startingOffsets latest - no data from structured streaming kafka query

2020-04-21 Thread Ruijing Li
Yes, we did. But for some reason latest does not show them. The count is always 0. On Sun, Apr 19, 2020 at 3:42 PM Jungtaek Lim wrote: > Did you provide more records to topic "after" you started the query? > That's the only one I can imagine based on such information. > > On Fri, Apr 17, 2020

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-21 Thread Ruijing Li
In thread dump, I do see this - SparkUI-160- acceptor-id-ServerConnector@id(HTTP/1.1) | RUNNABLE | Monitor - SparkUI-161-acceptor-id-ServerConnector@id(HTTP/1.1) | BLOCKED | Blocked by Thread(Some(160)) Lock - SparkUI-159-acceptor-id-ServerConnector@id(HTTP/1.1) | BLOCKED | Blocked by

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-21 Thread Ruijing Li
After refreshing a couple of times, I notice the lock is being swapped between these 3. The other 2 will be blocked by whoever gets this lock, in a cycle of 160 has lock -> 161 -> 159 -> 160 On Tue, Apr 21, 2020 at 10:33 AM Ruijing Li wrote: > In thread dump, I do see this > - SparkUI-160-

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-21 Thread Ruijing Li
Strangely enough I found an old issue that is the exact same issue as mine https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-18343 However I’m using spark 2.4.4 so the issue should have been solved by now. Like the user in the jira issue I am using mesos, but I am reading from

Re: Using P4J Plugins with Spark

2020-04-21 Thread Todd Nist
You may want to make sure you include the jar of P4J and your plugins as part of the following so that both the driver and executors have access. If HDFS is out then you could make a common mount point on each of the executor nodes so they have access to the classes. - spark-submit --jars

Spark Structure Streaming | FileStreamSourceLog not deleting list of input files | Spark -2.4.0

2020-04-21 Thread Pappu Yadav
Hi Team, While Running Spark Below are some finding. 1. FileStreamSourceLog is responsible for maintaining input source file list. 2. Spark Streaming delete expired log files on the basis of s *park.sql.streaming.fileSource.log.deletion* and

Re: What is the best way to take the top N entries from a hive table/data source?

2020-04-21 Thread ZHANG Wei
https://github.com/apache/spark/pull/7334 may explain the question as below: > This patch preserves this optimization by treating logical Limit operators > specially when they appear as the terminal operator in a query plan: if a > Limit is the final operator, then we will plan a special

Using P4J Plugins with Spark

2020-04-21 Thread Shashanka Balakuntala
Hi users, I'm a bit of newbie to spark infrastructure. And i have a small doubt. I have a maven project with plugins generated separately in a folder and normal java command to run is as follows: `java -Dp4j.pluginsDir=./plugins -jar /path/to/jar` Now when I run this program in local with