Re: [Spark Structured Streaming] Measure metrics from CsvSink for Rate source

2018-06-21 Thread Dhruv Kumar
Thanks a lot for your mail Jungtaek. I added the StreamingQueryListener into my code (updated code ) and was able to see valid inputRowsPerSecond, processRowsPerSecond numbers. But it also shows zeros intermittently. Here is the

Re: [Spark Structured Streaming] Measure metrics from CsvSink for Rate source

2018-06-21 Thread Jungtaek Lim
I'm referring to 2.4.0-SNAPSHOT (not sure which commit I'm referring) but it properly returns the input rate. $ tail -F /tmp/spark-trial-metric/local-1529640063554.driver.spark.streaming.counts.inputRate-total.csv t,value 1529640073,0.0 1529640083,0.9411272613196695 1529640093,0.9430996541967934

[Spark Structured Streaming] Measure metrics from CsvSink for Rate source

2018-06-21 Thread Dhruv Kumar
Hi I was trying to measure the performance metrics for spark structured streaming. But I am unable to see any data in the metrics log files. My input source is the Rate source

Re: RepartitionByKey Behavior

2018-06-21 Thread Jungtaek Lim
It is not possible because the cardinality of the partitioning key is non-deterministic, while partition count should be fixed. There's a chance that cardinality > partition count and then the system can't ensure the requirement. Thanks, Jungtaek Lim (HeartSaVioR) 2018년 6월 22일 (금) 오전 8:55,

Re: RepartitionByKey Behavior

2018-06-21 Thread Chawla,Sumit
Based on code read it looks like Spark does modulo of key for partition. Keys of c and b end up pointing to same value. Whats the best partitioning scheme to deal with this? Regards Sumit Chawla On Thu, Jun 21, 2018 at 4:51 PM, Chawla,Sumit wrote: > Hi > > I have been trying to this simple

RepartitionByKey Behavior

2018-06-21 Thread Chawla,Sumit
Hi I have been trying to this simple operation. I want to land all values with one key in same partition, and not have any different key in the same partition. Is this possible? I am getting b and c always getting mixed up in the same partition. rdd = sc.parallelize([('a', 5), ('d', 8),

Re: Spark 2.3.1 not working on Java 10

2018-06-21 Thread vaquar khan
Sure let me check Jira Regards, Vaquar khan On Thu, Jun 21, 2018, 4:42 PM Takeshi Yamamuro wrote: > In this ticket SPARK-24201, the ambiguous statement in the doc had been > pointed out. > can you make pr for that? > > On Fri, Jun 22, 2018 at 6:17 AM, vaquar khan > wrote: > >>

Re: Spark 2.3.1 not working on Java 10

2018-06-21 Thread Takeshi Yamamuro
In this ticket SPARK-24201, the ambiguous statement in the doc had been pointed out. can you make pr for that? On Fri, Jun 22, 2018 at 6:17 AM, vaquar khan wrote: > https://spark.apache.org/docs/2.3.0/ > > Avoid confusion we need to updated doc with supported java version "*Java8 > + " *word

Re: Spark 2.3.1 not working on Java 10

2018-06-21 Thread vaquar khan
https://spark.apache.org/docs/2.3.0/ Avoid confusion we need to updated doc with supported java version "*Java8 + " *word confusing for users Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. For the Scala API, Spark 2.3.0 uses Scala 2.11. You will need to use a compatible Scala version

Re: Spark 2.3.0 and Custom Sink

2018-06-21 Thread Lalwani, Jayesh
Actually, you can do partition level ingest using ForEachWriter. You just have to add each row to a list in the write method, and write to the data store in the close method I know it’s awkward. I don’t know why Spark doesn’t provide a ForEachPartitionWriter From: Yogesh Mahajan Date:

Re: Spark 2.3.0 and Custom Sink

2018-06-21 Thread Yogesh Mahajan
Since ForeachWriter works at a record level so you cannot do bulk ingest into KairosDB, which supports bulk inserts. This will be slow. Instead, you can have your own Sink implementation which is a batch (DataFrame) level. Thanks, http://www.snappydata.io/blog On Thu, Jun

Spark 2.3.0 and Custom Sink

2018-06-21 Thread subramgr
Hi Spark Mailing list, We are looking for pushing the output of the structured streaming query output to KairosDB. (time series database) What would be the recommended way of doing this? Do we implement the *Sink* trait or do we use the *ForEachWriter* At each trigger point if I do a

Re: Spark 2.3.1 not working on Java 10

2018-06-21 Thread chriswakare
Hi Rahul, This will work only in Java 8. Installation does not work with both version 9 and 10 Thanks, Christopher -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail:

Re: Spark 2.3.1 not working on Java 10

2018-06-21 Thread Rahul Agrawal
Thanks Felix. Do you know if you support Java 9? Thanks, Rahul On Thu, Jun 21, 2018 at 8:11 PM, Felix Cheung wrote: > I’m not sure we have completed support for Java 10 > > -- > *From:* Rahul Agrawal > *Sent:* Thursday, June 21, 2018 7:22:42 AM > *To:*

Re: Spark 2.3.1 not working on Java 10

2018-06-21 Thread Felix Cheung
I'm not sure we have completed support for Java 10 From: Rahul Agrawal Sent: Thursday, June 21, 2018 7:22:42 AM To: user@spark.apache.org Subject: Spark 2.3.1 not working on Java 10 Dear Team, I have installed Java 10, Scala 2.12.6 and spark 2.3.1 in my

Re: Does Spark Structured Streaming have a JDBC sink or Do I need to use ForEachWriter?

2018-06-21 Thread kant kodali
>From my experience so far, update mode in structured streaming is the most useful one out of the three available modes. But when it comes it RDBMS there isn't really an upsert so If I go with ForEachWriter I wouldn't quite know when to do an Insert or an update unless I really tie it with my

Spark 2.3.1 not working on Java 10

2018-06-21 Thread Rahul Agrawal
Dear Team, I have installed Java 10, Scala 2.12.6 and spark 2.3.1 in my desktop having Ubuntu 16.04. I am getting error opening spark-shell. Failed to initialize compiler: object java.lang.Object in compiler mirror not found. Please let me know if there is any way to run spark in Java 10.

Re: Does Spark Structured Streaming have a JDBC sink or Do I need to use ForEachWriter?

2018-06-21 Thread Lalwani, Jayesh
Open source Spark Structured Streaming doesn’t have a JDBC sink. You can implement your own ForEachWriter, or you can use my sink from here https://github.com/GaalDornick/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/JdbcSink.scala

createorreplacetempview cause memory leak

2018-06-21 Thread onmstester onmstester
I'm loading some json files in a loop, deserialize them in a list of objects and create a temp table from the list, run a select on table (repeat this for every file): for(jsonFile : allJsonFiles){ sqlcontext.sql("select * from mainTable").filter(").createOrReplaceTempView("table1");

restarting ranger kms causes spark thrift server to stop

2018-06-21 Thread quentinlam
Hi, When one of the Ranger KMS is stopped, the spark thrift server remains fine but when the Ranger KMS is started back up, the spark thrift server stops and throws the following errors: java.lang.ClassCastException: org.apache.hadoop.security.authentication.client.AuthenticationException cannot

Re: Does Spark Structured Streaming have a JDBC sink or Do I need to use ForEachWriter?

2018-06-21 Thread Tathagata Das
Actually, we do not support jdbc sink yet. The blog post was just an example :) I agree it is misleading in hindsight. On Wed, Jun 20, 2018 at 6:09 PM, kant kodali wrote: > Hi All, > > Does Spark Structured Streaming have a JDBC sink or Do I need to use > ForEachWriter? I see the following code