Re: Akka Stream as the source for Spark Streaming. Please advice...

2016-11-12 Thread shyla deshpande
Is it OK to use ProtoBuf for sending messages to Kafka? I do not see anyone using it . Please direct me to some code samples of how to use it in Spark Structured streaming. Thanks again.. On Sat, Nov 12, 2016 at 11:44 PM, shyla deshpande wrote: > Thanks everyone.

Re: Akka Stream as the source for Spark Streaming. Please advice...

2016-11-12 Thread shyla deshpande
Thanks everyone. Very good discussion. Thanks Jacek, for the code snippet. I downloaded your Mastering Apache Spark pdf . I love it. I have one more question, On Sat, Nov 12, 2016 at 2:21 PM, Sean McKibben wrote: > I think one of the advantages of using akka-streams

Re: Possible DR solution

2016-11-12 Thread Mich Talebzadeh
Hi, I meant the way Wandisco does replication. Streaming blocks of data one after another. You are correct that temporary directories need not be replicated. One of their point is that one can replicate a cluster from say NY to Singapore. I much doubt if that is doable given the volume of data

unsubscribe

2016-11-12 Thread Bibudh Lahiri
unsubscribe -- Bibudh Lahiri Senior Data Scientist, Impetus Technolgoies 720 University Avenue, Suite 130 Los Gatos, CA 95129 http://knowthynumbers.blogspot.com/

spark-shell not starting ( in a Kali linux 2 OS)

2016-11-12 Thread Kelum Perera
Dear Users, I'm a newbie, trying to get spark-shell using kali linux OS, but getting error - "spark-shell: command not found" I'm running on Kali Linux 2 (64bit) I followed several tutorial including: https://www.tutorialspoint.com/apache_spark/apache_spark_installation.htm

Re: Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread dev loper
I haven't tried rdd.distinct. I thought since redcuceByKey itself is not helping me even with a sliding window here ,so i thought rdd.distinct might not help . I will write a minimal code for reproducing the issue and share it with you guys. One another point I want to bring in is that I am

toDebugString is clipped

2016-11-12 Thread Anirudh Perugu
Hello all, I am trying to understanding how graphx works internally. I created a small program in graphx : 1. I create a new graph val graph: Graph[(String, Double), Int] = Graph(vertexRDD, edgeRDD) 2. Now I want to see how my vertices were created, hence I use scala>

Re: Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread ayan guha
Have you tried rdd.distinc? On Sun, Nov 13, 2016 at 8:28 AM, Cody Koeninger wrote: > Can you come up with a minimal reproducible example? > > Probably unrelated, but why are you doing a union of 3 streams? > > On Sat, Nov 12, 2016 at 10:29 AM, dev loper

Re: Akka Stream as the source for Spark Streaming. Please advice...

2016-11-12 Thread Sean McKibben
I think one of the advantages of using akka-streams within Spark is the fact that it is a general purpose stream processing toolset with backpressure, not necessarily specific to kafka. If things work out with the approach, Spark could be a great benefit to use as a coordination framework for

Re: Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread Cody Koeninger
Can you come up with a minimal reproducible example? Probably unrelated, but why are you doing a union of 3 streams? On Sat, Nov 12, 2016 at 10:29 AM, dev loper wrote: > There are no failures or errors. Irrespective of that I am seeing > duplicates. The steps and stages are

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-12 Thread Cody Koeninger
You should not be getting consumer churn on executors at all, that's the whole point of the cache. How many partitions are you trying to process per executor? http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#locationstrategies gives instructions on the default size of

Re: Possible DR solution

2016-11-12 Thread Mich Talebzadeh
Thanks for the links. The difficulty with building DR for HDFS is the distributed nature of HDFS. If each DataNode had a mirror copy in DR via something similar to SRDF (assuming NameNode and others taken care of), then there would not be an issue. The fail-over would be starting the mirror HDFS

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-12 Thread Ivan von Nagy
Hi Sean, Thanks for responding. We have run our jobs with internal parallel processing for well over a year (Spark 1.5, 1.6 and Kafka 0.8.2.2.) and did not encounter any of these issues until upgrading to Spark 2.0.1 and Kafka 0.10 clients. If we process serially, then we sometimes get the

Re: Strongly Connected Components

2016-11-12 Thread Koert Kuipers
oh ok i see now its not the same On Sat, Nov 12, 2016 at 2:48 PM, Koert Kuipers wrote: > not sure i see the faster algo in the paper you mention. > > i see this in section 6.1.2: > "In what follows we give a simple labeling algorithm that computes > connectivity on sparse

Re: Strongly Connected Components

2016-11-12 Thread Koert Kuipers
not sure i see the faster algo in the paper you mention. i see this in section 6.1.2: "In what follows we give a simple labeling algorithm that computes connectivity on sparse graphs in O(log N) rounds." N here is the size of the graph, not the largest component diameter. that is the exact

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-12 Thread Sean McKibben
How are you iterating through your RDDs in parallel? In the past (Spark 1.5.2) when I've had actions being performed on multiple RDDs concurrently using futures, I've encountered some pretty bad behavior in Spark, especially during job retries. Very difficult to explain things, like records

Re: Possible DR solution

2016-11-12 Thread deepak.subhramanian
Sent from my Samsung Galaxy smartphone. Original message From: Timur Shenkao Date: 12/11/2016 09:17 (GMT-08:00) To: Mich Talebzadeh , user@spark.apache.org Subject: Re: Possible DR solution Hi guys! 1) Though it's quite

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-12 Thread Ivan von Nagy
The code was changed to use a unique group for each KafkaRDD that was created (see Nov 10 post). There is no KafkaRDD being reused. The basic logic (see Nov 10 post for example) is get a list of channels, iterate through them in parallel, load a KafkaRDD using a given topic and a consumer group

Re: Possible DR solution

2016-11-12 Thread Timur Shenkao
Hi guys! 1) Though it's quite interesting, I believe that this discussion is not about Spark :) 2) If you are interested, there is solution by Cloudera https://www.cloudera.com/documentation/enterprise/5-5-x/topics/cm_bdr_replication_intro.html (requires that *source cluster* has Cloudera

Re: Akka Stream as the source for Spark Streaming. Please advice...

2016-11-12 Thread Jacek Laskowski
Hi Luciano, Mind sharing why to have a structured streaming source/sink for Akka if Kafka's available and Akka Streams has a Kafka module? #curious Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark Follow me at

Re: Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread dev loper
There are no failures or errors. Irrespective of that I am seeing duplicates. The steps and stages are all successful and even the speculation is turned off . On Sat, Nov 12, 2016 at 9:55 PM, Cody Koeninger wrote: > Are you certain you aren't getting any failed tasks or

Re: Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread Cody Koeninger
Are you certain you aren't getting any failed tasks or other errors? Output actions like foreach aren't exactly once and will be retried on failures. On Nov 12, 2016 06:36, "dev loper" wrote: > Dear fellow Spark Users, > > My Spark Streaming application (Spark 2.0 , on AWS

Re: Joining to a large, pre-sorted file

2016-11-12 Thread Silvio Fiorito
Hi Stuart, You don’t need the sortBy or sortWithinPartitions. https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/594325853373464/879901972425732/6861830365114179/latest.html This is what the job should look like:

spark streaming with kinesis

2016-11-12 Thread Shushant Arora
*Hi * *is **spark.streaming.blockInterval* for kinesis input stream is hardcoded to 1 sec or is it configurable ? Time interval at which receiver fetched data from kinesis . Means stream batch interval cannot be less than *spark.streaming.blockInterval and this should be configrable , Also is

Re: Akka Stream as the source for Spark Streaming. Please advice...

2016-11-12 Thread Luciano Resende
If you are interested in Akka streaming, it is being maintained in Apache Bahir. For Akka there isn't a structured streaming version yet, but we would be interested in collaborating in the structured streaming version for sure. On Thu, Nov 10, 2016 at 8:46 AM shyla deshpande

Re: Possible DR solution

2016-11-12 Thread Mich Talebzadeh
Thanks Jorn. The way WanDisco promotes itself is doing block level replication. as I understand you modify core-file.xml and add couple of network server locations there. they call this tool Fusion. there are at least 2 fusion servers for high availability. each one among other things has a

Re: Akka Stream as the source for Spark Streaming. Please advice...

2016-11-12 Thread Jacek Laskowski
Hi, Just to add to Cody's answer...the following snippet works fine on master: spark.readStream .format("kafka") .option("subscribe", "topic") .option("kafka.bootstrap.servers", "localhost:9092") .load .writeStream .format("console") .start Don't forget to add spark-sql-kafka-0-10

Re: Spark joins using row id

2016-11-12 Thread Rohit Verma
Result of explain is as follows *BroadcastHashJoin [rowN#0], [rowN#39], LeftOuter, BuildRight :- *Project [rowN#0, informer_code#22] : +- Window [rownumber() windowspecdefinition(informer_code#22 ASC, ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS rowN#0], [informer_code#22 ASC] : +-

Re: Joining to a large, pre-sorted file

2016-11-12 Thread Stuart White
Thanks for the reply. I understand that I need to use bucketBy() to write my master file, but I still can't seem to make it work as expected. Here's a code example for how I'm writing my master file: Range(0, 100) .map(i => (i, s"master_$i")) .toDF("key", "value") .write

Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread dev loper
Dear fellow Spark Users, My Spark Streaming application (Spark 2.0 , on AWS EMR yarn cluster) listens to Campaigns based on live stock feeds and the batch duration is 5 seconds. The applications uses Kafka DirectStream and based on the feed source there are three streams. As given in the code

Re: pyspark: accept unicode column names in DataFrame.corr and cov

2016-11-12 Thread Hyukjin Kwon
Hi Sam, I think I have some answers for two questions. > Humble request: could we replace the "isinstance(col1, str)" tests with "isinstance(col1, basestring)"? IMHO, yes, I believe this should be basestring. Otherwise, some functions would not accept unicode as arguments for columns in Python

Re: Possible DR solution

2016-11-12 Thread Jörn Franke
What is wrong with the good old batch transfer for transferring data from a cluster to another? I assume your use case is only business continuity in case of disasters such as data center loss, which are unlikely to happen (well it does not mean they do not happen) and where you could afford to

Spark joins using row id

2016-11-12 Thread Rohit Verma
For datasets structured as ds1 rowN col1 1 A 2 B 3 C 4 C … and ds2 rowN col2 1 X 2 Y 3 Z … I want to do a left join Dataset joined = ds1.join(ds2,”rowN”,”left outer”); I somewhere read in SO or this mailing list that if spark is aware of datasets

Re: Possible DR solution

2016-11-12 Thread Mich Talebzadeh
thanks Vince can you provide more details on this pls Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com

Re: Exception not failing Python applications (in yarn client mode) - SparkLauncher says app succeeded, where app actually has failed

2016-11-12 Thread ayan guha
It's a known issue. There was one jira which was marked resolved but I still faced it in 1.6. On 12 Nov 2016 14:33, "Elkhan Dadashov" wrote: > Hi, > > *Problem*: > Spark job fails, but RM page says the job succeeded, also > > appHandle = sparkLauncher.startApplication() >

Re: Correct SparkLauncher usage

2016-11-12 Thread Elkhan Dadashov
Hey Mohammad, I implemented the code using CountDownLatch, and SparkLauncher works as expected. Hope it helps. Whenever appHandle.getState() reaching one of The Final states, then countDownLatch is decreased, and execution returns back to main program. ...final CountDownLatch countDownLatch =

Re: SparkDriver memory calculation mismatch

2016-11-12 Thread Elkhan Dadashov
In my particular case (to make Spark launching asynchronous), i launch Hadoop job, which consists of only 1 Spark job - which is launched via SparkLauncher#startApplication(). My App --- Launches Map task() --> into Cluster Map

Re: Possible DR solution

2016-11-12 Thread vincent gromakowski
A Hdfs tiering policy with good tags should be similar Le 11 nov. 2016 11:19 PM, "Mich Talebzadeh" a écrit : > I really don't see why one wants to set up streaming replication unless > for situations where similar functionality to transactional databases is > required

Re: SparkDriver memory calculation mismatch

2016-11-12 Thread Sean Owen
Indeed, you get default values if you don't specify concrete values otherwise. Yes, you should see the docs for the version you're using. Note that there are different configs for the new 'unified' memory manager since 1.6, and so some older resources may be correctly explaining the older

Re: SparkDriver memory calculation mismatch

2016-11-12 Thread Elkhan Dadashov
@Sean Owen, Thanks for your reply. I put the wrong link to the blog post. Here is the correct link which describes Spark Memory settings on Yarn. I guess they have misused the terms Spark

Re: SparkDriver memory calculation mismatch

2016-11-12 Thread Sean Owen
If you're pointing at the 336MB, then it's not really related any of the items you cite here. This is the memory managed internally by MemoryStore. The blog post refers to the legacy memory manager. You can see a bit of how it works in the code, but this is the sum of the on-heap and off-heap