[Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-24 Thread Jerry Lam
Hi Spark users and developers, Does anyone encounter any issue when a spark SQL job produces a lot of files (over 1 millions), the job hangs on the refresh method? I'm using spark 1.5.1. Below is the stack trace. I saw the parquet files are produced but the driver is doing something very intensive

Re: [SPARK STREAMING] Concurrent operations in spark streaming

2015-10-24 Thread Andy Dang
If you execute the collect step (foreach in 1, possibly reduce in 2) in two threads in the driver then both of them will be executed in parallel. Whichever gets submitted to Spark first gets executed first - you can use a semaphore if you need to ensure the ordering of execution, though I would ass

[SPARK STREAMING] Concurrent operations in spark streaming

2015-10-24 Thread Nipun Arora
I wanted to understand something about the internals of spark streaming executions. If I have a stream X, and in my program I send stream X to function A and function B: 1. In function A, I do a few transform/filter operations etc. on X->Y->Z to create stream Z. Now I do a forEach Operation on Z

spark inner join

2015-10-24 Thread kali.tumm...@gmail.com
Hi All, In sql say for example I have table1 (moveid) and table2 (movieid,moviename) in sql we write something like select moviename ,movieid,count(1) from table2 inner join table table1 on table1.movieid=table2.moveid group by , here in sql table1 has only one column where as table 2 has tw

java how to configure streaming.dstream.DStream<> saveAsTextFiles() to work with hdfs?

2015-10-24 Thread Andy Davidson
Hi I am using spark streaming in Java. One of the problems I have is I need to save twitter status in JSON format as I receive them When I run the following code on my local machine. It work how ever all the output files are created in the current directory of the driver program. Clearly not a g

Re: question about HadoopFsRelation

2015-10-24 Thread Ted Yu
The code below was introduced by SPARK-7673 / PR #6225 See item #1 in the description of the PR. Cheers On Sat, Oct 24, 2015 at 12:59 AM, Koert Kuipers wrote: > the code that seems to flatMap directories to all the files inside is in > the private HadoopFsRelation.buildScan: > > // First a

Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project

2015-10-24 Thread Dibyendu Bhattacharya
Hi, I have raised a JIRA ( https://issues.apache.org/jira/browse/SPARK-11045) to track the discussion but also mailing user group . This Kafka consumer is around for a while in spark-packages ( http://spark-packages.org/package/dibbhatt/kafka-spark-consumer ) and I see many started using it , I a

Re: "Failed to bind to" error with spark-shell on CDH5 and YARN

2015-10-24 Thread Steve Loughran
better wiki entry https://wiki.apache.org/hadoop/BindException

Re: "Failed to bind to" error with spark-shell on CDH5 and YARN

2015-10-24 Thread Steve Loughran
On 24 Oct 2015, at 00:46, Lin Zhao mailto:l...@exabeam.com>> wrote: I have a spark on YARN deployed using Cloudera Manager 5.4. The installation went smoothly. But when I try to run spark-shell I get a long list of exceptions saying "failed to bind to: /public_ip_of_host:0" and "Service 'spar

Unable to use saveAsSequenceFile

2015-10-24 Thread Amit Singh Hora
Hi All, I am trying to wrote an RDD as Sequence file into my Hadoop cluster but getting connection time out again and again ,I can ping the hadoop cluster and also directory gets created with the file name i specify ,I believe I am missing some configuration ,Kindly help me object WriteSequenceF

Re: question about HadoopFsRelation

2015-10-24 Thread Koert Kuipers
the code that seems to flatMap directories to all the files inside is in the private HadoopFsRelation.buildScan: // First assumes `input` is a directory path, and tries to get all files contained in it. fileStatusCache.leafDirToChildrenFiles.getOrElse( path, // Otherwise,