Re: Adding header to an rdd before saving to text file

2017-06-05 Thread Yan Facai
Hi, upendra. It will be easier to use DataFrame to read/save csv file with header, if you'd like. On Tue, Jun 6, 2017 at 5:15 AM, upendra 1991 wrote: > I am reading a CSV(file has headers header 1st,header2) and generating > rdd, > After few transformations I

Spark on Kubernetes: Birds-of-a-Feather Session 12:50pm 6/6 @ Spark Summit

2017-06-05 Thread Erik Erlandson
Come learn about the community development project to add a native Kubernetes scheduling back-end to Apache Spark! Meet contributors and network with community members interested in running Spark on Kubernetes. Learn how to run Spark jobs on your Kubernetes cluster; find out how to contribute to

Spark Streaming Job Stuck

2017-06-05 Thread Jain, Nishit
I have a very simple spark streaming job running locally in standalone mode. There is a customer receiver which reads from database and pass it to the main job which prints the total. Not an actual use case but I am playing around to learn. Problem is that job gets stuck forever, logic is very

Adding header to an rdd before saving to text file

2017-06-05 Thread upendra 1991
I am reading a CSV(file has headers header 1st,header2) and generating rdd,  After few transformations I create an rdd and finally write it to a txt file.  What's the best way to add the header from source file, into rdd and have it available as header into new file I.e, when I transform the rdd

Re: An Architecture question on the use of virtualised clusters

2017-06-05 Thread Mich Talebzadeh
My main concern is that the choice of Isolin is not for one use case. It will be a strategic decision for the client and if we decide to go that way we are effectively moving away from HDFS principals (3x replication) etc as well. Granted one can argue this may be OK but of course we have to look

Edge Node in Spark

2017-06-05 Thread Ashok Kumar
Hi, I am a bit confused between Edge node, Edge server and gateway node in Spark. Do these mean the same thing? How does one set up an Edge node to be used in Spark? Is this different from Edge node for Hadoop please? Thanks

Re: Incorrect CAST to TIMESTAMP in Hive compatibility

2017-06-05 Thread Anton Okolnychyi
Hi, I also noticed this issue. Actually, it was already mentioned several times. There is an existing JIRA(SPARK-17914). I am going to submit a PR to fix this in a few days. Best, Anton On Jun 5, 2017 21:42, "verbamour" wrote: > Greetings, > > I am using Hive

Incorrect CAST to TIMESTAMP in Hive compatibility

2017-06-05 Thread verbamour
Greetings, I am using Hive compatibility in Spark 2.1.1 and it appears that the CAST string to TIMESTAMP improperly trims the sub-second value. In particular, leading zeros in the decimal portion appear to be dropped. Steps to reproduce: 1. From `spark-shell` issue: `spark.sql("SELECT

Re: SparkAppHandle.Listener.infoChanged behaviour

2017-06-05 Thread Mohammad Tariq
Hi Marcelo, Thank you so much for the response. Appreciate it! [image: --] Tariq, Mohammad [image: https://]about.me/mti [image: http://] Tariq, Mohammad about.me/mti [image: http://]

Fwd: Spark Streaming Checkpoint and Exactly Once Guarantee on Kafka Direct Stream

2017-06-05 Thread anbucheeralan
I am using Spark Streaming Checkpoint and Kafka Direct Stream. It uses a 30 sec batch duration and normally the job is successful in 15-20 sec. If the spark application fails after the successful completion (149668428ms in the log below) and restarts, it's duplicating the last batch again.

Spark Streaming Checkpoint and Exactly Once Guarantee on Kafka Direct Stream

2017-06-05 Thread ALunar Beach
I am using Spark Streaming Checkpoint and Kafka Direct Stream. It uses a 30 sec batch duration and normally the job is successful in 15-20 sec. If the spark application fails after the successful completion (149668428ms in the log below) and restarts, it's duplicating the last batch again.

Kafka + Spark Streaming consumer API offsets

2017-06-05 Thread Nipun Arora
I need some clarification for Kafka consumers in Spark or otherwise. I have the following Kafka Consumer. The consumer is reading from a topic, and I have a mechanism which blocks the consumer from time to time. The producer is a separate thread which is continuously sending data. I want to

Re: An Architecture question on the use of virtualised clusters

2017-06-05 Thread John Leach
Mich, Yes, Isilon is in production... Isilon is a serious product and has been around for quite a while. For on-premise external storage, we see it quite a bit. Separating the compute from the storage actually helps. It is also a nice transition to the cloud providers. Have you looked

Re: An Architecture question on the use of virtualised clusters

2017-06-05 Thread Mich Talebzadeh
Hi John, Thanks. Did you end up in production or in other words besides PoC did you use it in anger? The intention is to build Isilon on top of the whole HDFS cluster!. If we go that way we also need to adopt it for DR as well. Cheers Dr Mich Talebzadeh LinkedIn *

Re: An Architecture question on the use of virtualised clusters

2017-06-05 Thread John Leach
Mich, We used Isilon for a POC of Splice Machine (Spark for Analytics, HBase for real-time). We were concerned initially and the initial setup took a bit longer than excepted, but it performed well on both low latency and high throughput use cases at scale (our POC ~ 100 TB). Just a data

Re: An Architecture question on the use of virtualised clusters

2017-06-05 Thread Mich Talebzadeh
I am concerned about the use case of tools like Isilon or Panasas to create a layer on top of HDFS, essentially a HCFS on top of HDFS with the usual 3x replication gone into the tool itself. There is interest to push Isilon as a the solution forward but my caution is about scalability and future

Re: What is the easiest way for an application to Query parquet data on HDFS?

2017-06-05 Thread Muthu Jayakumar
I run a spark-submit(https://spark.apache.org/docs/latest/spark-standalone. html#launching-spark-applications) in client-mode that starts the micro-service. If you keep the event loop going then the spark context would remain active. Thanks, Muthu On Mon, Jun 5, 2017 at 2:44 PM, kant kodali

Re: What is the easiest way for an application to Query parquet data on HDFS?

2017-06-05 Thread kant kodali
Are you launching SparkSession from a MicroService or through spark-submit ? On Sun, Jun 4, 2017 at 11:52 PM, Muthu Jayakumar wrote: > Hello Kant, > > >I still don't understand How SparkSession can use Akka to communicate > with SparkCluster? > Let me use your initial

Re: What is the easiest way for an application to Query parquet data on HDFS?

2017-06-05 Thread Muthu Jayakumar
Hello Kant, >I still don't understand How SparkSession can use Akka to communicate with SparkCluster? Let me use your initial requirement as a way to illustrate what I mean -- i.e, "I want my Micro service app to be able to query and access data on HDFS" In order to run a query say a DF query