Re: question about HadoopFsRelation

2015-10-24 Thread Ted Yu
The code below was introduced by SPARK-7673 / PR #6225 See item #1 in the description of the PR. Cheers On Sat, Oct 24, 2015 at 12:59 AM, Koert Kuipers wrote: > the code that seems to flatMap directories to all the files inside is in > the private

Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project

2015-10-24 Thread Dibyendu Bhattacharya
Hi, I have raised a JIRA ( https://issues.apache.org/jira/browse/SPARK-11045) to track the discussion but also mailing user group . This Kafka consumer is around for a while in spark-packages ( http://spark-packages.org/package/dibbhatt/kafka-spark-consumer ) and I see many started using it , I

java how to configure streaming.dstream.DStream<> saveAsTextFiles() to work with hdfs?

2015-10-24 Thread Andy Davidson
Hi I am using spark streaming in Java. One of the problems I have is I need to save twitter status in JSON format as I receive them When I run the following code on my local machine. It work how ever all the output files are created in the current directory of the driver program. Clearly not a

[SPARK STREAMING] Concurrent operations in spark streaming

2015-10-24 Thread Nipun Arora
I wanted to understand something about the internals of spark streaming executions. If I have a stream X, and in my program I send stream X to function A and function B: 1. In function A, I do a few transform/filter operations etc. on X->Y->Z to create stream Z. Now I do a forEach Operation on Z

Re: [SPARK STREAMING] Concurrent operations in spark streaming

2015-10-24 Thread Andy Dang
If you execute the collect step (foreach in 1, possibly reduce in 2) in two threads in the driver then both of them will be executed in parallel. Whichever gets submitted to Spark first gets executed first - you can use a semaphore if you need to ensure the ordering of execution, though I would

spark inner join

2015-10-24 Thread kali.tumm...@gmail.com
Hi All, In sql say for example I have table1 (moveid) and table2 (movieid,moviename) in sql we write something like select moviename ,movieid,count(1) from table2 inner join table table1 on table1.movieid=table2.moveid group by , here in sql table1 has only one column where as table 2 has

Re: "Failed to bind to" error with spark-shell on CDH5 and YARN

2015-10-24 Thread Steve Loughran
On 24 Oct 2015, at 00:46, Lin Zhao > wrote: I have a spark on YARN deployed using Cloudera Manager 5.4. The installation went smoothly. But when I try to run spark-shell I get a long list of exceptions saying "failed to bind to: /public_ip_of_host:0"

Re: Multiple Spark Streaming Jobs on Single Master

2015-10-24 Thread gaurav sharma
I specified spark,cores.max = 4 but it started 2 executors with 2 cores each on each of the 2 workers. in standalone cluster mode, though we can specify Worker cores, there is no ways to specify Number of cores executor must take on that particular worker machine. On Sat, Oct 24, 2015 at 1:41

Re: question about HadoopFsRelation

2015-10-24 Thread Koert Kuipers
the code that seems to flatMap directories to all the files inside is in the private HadoopFsRelation.buildScan: // First assumes `input` is a directory path, and tries to get all files contained in it. fileStatusCache.leafDirToChildrenFiles.getOrElse( path, //

Re: Huge shuffle data size

2015-10-24 Thread Sabarish Sasidharan
How many rows are you joining? How many rows in the output? Regards Sab On 24-Oct-2015 2:32 am, "pratik khadloya" wrote: > Actually the groupBy is not taking a lot of time. > The join that i do later takes the most (95 %) amount of time. > Also, the grouping i am doing is

Unable to use saveAsSequenceFile

2015-10-24 Thread Amit Singh Hora
Hi All, I am trying to wrote an RDD as Sequence file into my Hadoop cluster but getting connection time out again and again ,I can ping the hadoop cluster and also directory gets created with the file name i specify ,I believe I am missing some configuration ,Kindly help me object

Re: "Failed to bind to" error with spark-shell on CDH5 and YARN

2015-10-24 Thread Steve Loughran
better wiki entry https://wiki.apache.org/hadoop/BindException