Re: LibSVM should have just one input file

2017-06-11 Thread Yan Facai
Hi, yaphet. It seems that the code you pasted should be located in LibSVM, rather than SVM. Do I misunderstand? For LibSVMDataSource, 1. if numFeatures is unspecified, only one file is valid input. val df = spark.read.format("libsvm") .load("data/mllib/sample_libsvm_data.txt") 2. otherwise,

Use SQL Script to Write Spark SQL Jobs

2017-06-11 Thread bo yang
Hi Guys, I am writing a small open source project to use SQL Script to write Spark Jobs. Want to see if there are other people interested to use or contribute to this project. The project is called UberScriptQuery (

help with "ERROR server.TransportRequestHandler: Error sending result StreamResponse"

2017-06-11 Thread Steve Sun
I have a Spark job which reads Hive data from S3 and use that data to generate HFile. When I'm reading a single ORC file (about 190 MB), this job runs perfectly fine. However, when I tried to read the entire directory: about 400 ORC files, so about 76 GB files, it keeps throwing me: 17/06/12

Re: Read Data From NFS

2017-06-11 Thread ayan guha
I understand how it works with hdfs. My question is when hdfs is not the file sustem, how number of partitions are calculated. Hope that makes it clearer. On Mon, 12 Jun 2017 at 2:42 am, vaquar khan wrote: > > > As per spark doc : > The textFile method also takes an

LibSVM should have just one input file

2017-06-11 Thread darion.yaphet
Hi team : Currently when we using SVM to train dataset we found the input files limit only one . the source code as following : valpath=if (dataFiles.length ==1) { dataFiles.head.getPath.toUri.toString } elseif (dataFiles.isEmpty) { thrownewIOException("No input path specified for libsvm

Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-11 Thread kant kodali
Also another difference I see is some thing like Spark Sql where there are logical plans, physical plans, Code generation and all those optimizations I don't see them in Kafka Streaming at this time. On Sun, Jun 11, 2017 at 2:19 PM, kant kodali wrote: > I appreciate the

Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-11 Thread kant kodali
I appreciate the responses however I see the other side of the argument and I actually feel they are competitors now in Streaming space in some sense. Kafka Streaming can indeed do map, reduce, join and window operations and Like wise data can be ingested from many sources in Kafka and send the

RE: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-11 Thread Mohammed Guller
Just to elaborate more on Vincent wrote – Kafka streaming provides true record-at-a-time processing capabilities whereas Spark Streaming provides micro-batching capabilities on top of Spark. Depending on your use case, you may find one better than the other. Both provide stateless ad stateful

Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-11 Thread vincent gromakowski
I think Kafka streams is good when the processing of each row is independant from each other (row parsing, data cleaning...) Spark is better when processing group of rows (group by, ml, window func...) Le 11 juin 2017 8:15 PM, "yohann jardin" a écrit : Hey, Kafka can

Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-11 Thread yohann jardin
Hey, Kafka can also do streaming on its own: https://kafka.apache.org/documentation/streams I don’t know much about it unfortunately. I can only repeat what I heard in conferences, saying that one should give a try to Kafka streaming when its whole pipeline is using Kafka. I have no pros/cons

Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-11 Thread yohann jardin
Hey, Kafka can also do streaming on its own: https://kafka.apache.org/documentation/streams I don't know much about it unfortunately. I can only repeat what I heard in conferences, saying that one should give a try to Kafka streaming when its whole pipeline is using Kafka. I have no pros/cons

Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-11 Thread vaquar khan
Hi Kant, Kafka is the message broker that using as Producers and Consumers and Spark Streaming is used as the real time processing ,Kafka and Spark Streaming work together not competitors. Spark Streaming is reading data from Kafka and process into micro batching for streaming data, In easy terms

Re: Read Data From NFS

2017-06-11 Thread vaquar khan
As per spark doc : The textFile method also takes an optional second argument for controlling the number of partitions of the file.* By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS)*, but you can also ask for a higher number of partitions

Re: Read Data From NFS

2017-06-11 Thread ayan guha
Hi My question is what happens if I have 1 file of say 100gb. Then how many partitions will be there? Best Ayan On Sun, 11 Jun 2017 at 9:36 am, vaquar khan wrote: > Hi Ayan, > > If you have multiple files (example 12 files )and you are using following > code then you

Re: [Spark JDBC] Does spark support read from remote Hive server via JDBC

2017-06-11 Thread Jörn Franke
Is sentry preventing the access? > On 11. Jun 2017, at 01:55, vaquar khan wrote: > > Hi , > Pleaae check your firewall security setting sharing link one good link. > > http://belablotski.blogspot.in/2016/01/access-hive-tables-from-spark-using.html?m=1 > > > > Regards,

What is the real difference between Kafka streaming and Spark Streaming?

2017-06-11 Thread kant kodali
Hi All, I am trying hard to figure out what is the real difference between Kafka Streaming vs Spark Streaming other than saying one can be used as part of Micro services (since Kafka streaming is just a library) and the other is a Standalone framework by itself. If I can accomplish same job one

Re: problem initiating spark context with pyspark

2017-06-11 Thread Gourav Sengupta
Generally I try to make best of the amount of memory my system has for computation. It might just be of help to see the amount of memory Windows takes just for running itself and then compare it with Ubuntu or any other linux or unix or solaris systems. But I am not quite sure of the used case of