Re: PySpark: slicing issue with dataframes

2015-05-17 Thread Davies Liu
Yes, it's a bug, please file a JIRA. On Sun, May 3, 2015 at 10:36 AM, Ali Bajwa ali.ba...@gmail.com wrote: Friendly reminder on this one. Just wanted to get a confirmation that this is not by design before I logged a JIRA Thanks! Ali On Tue, Apr 28, 2015 at 9:53 AM, Ali Bajwa

Re: how to set random seed

2015-05-17 Thread Davies Liu
The python workers used for each stage may be different, this may not work as expected. You can create a Random object, set the seed, use it to do the shuffle(). r = random.Random() r.seek(my_seed) def f(x): r.shuffle(l) rdd.map(f) On Thu, May 14, 2015 at 6:21 AM, Charles Hayden

Re: number of executors

2015-05-17 Thread Akhil Das
Did you try --executor-cores param? While you submit the job, do a ps aux | grep spark-submit and see the exact command parameters. Thanks Best Regards On Sat, May 16, 2015 at 12:31 PM, xiaohe lan zombiexco...@gmail.com wrote: Hi, I have a 5 nodes yarn cluster, I used spark-submit to submit

Effecient way to fetch all records on a particular node/partition in GraphX

2015-05-17 Thread mas
Hi All, I have distributed my RDD into say 10 nodes. I want to fetch the data that resides on a particular node say node 5. How i can achieve this? I have tried mapPartitionWithIndex function to filter the data of that corresponding node, however it is pretty expensive. Any efficient way to do

Re: number of executors

2015-05-17 Thread xiaohe lan
Sorry, them both are assigned task actually. Aggregated Metrics by Executor Executor IDAddressTask TimeTotal TasksFailed TasksSucceeded TasksInput Size / RecordsShuffle Write Size / RecordsShuffle Spill (Memory)Shuffle Spill (Disk)1host1:61841.7 min505640.0 MB / 12318400382.3 MB / 121007701630.4

RE: Spark Streaming and reducing latency

2015-05-17 Thread Evo Eftimov
This is the nature of Spark Streaming as a System Architecture: 1. It is a batch processing system architecture (Spark Batch) optimized for Streaming Data 2. In terms of sources of Latency in such System Architecture, bear in mind that besides “batching”, there is also the

Re: Data partitioning and node tracking in Spark-GraphX

2015-05-17 Thread MUHAMMAD AAMIR
Can you please elaborate the way to fetch the records from a particular partition (node in our case) For example, my RDD is distributed to 10 nodes and i want to fetch the data of one particular node/partition i.e. partition/node with index 5. How can i do this? I have tried

Re: Forbidded : Error Code: 403

2015-05-17 Thread Akhil Das
I think you can try this way also: DataFrame df = sqlContext.load(s3n://ACCESS-KEY:SECRET-KEY@bucket-name/file.avro, com.databricks.spark.avro); Thanks Best Regards On Sat, May 16, 2015 at 2:02 AM, Mohammad Tariq donta...@gmail.com wrote: Thanks for the suggestion Steve. I'll try that out.

Re: [SparkStreaming] Is it possible to delay the start of some DStream in the application?

2015-05-17 Thread Akhil Das
Why not just trigger your batch job with that event? If you really need streaming, then you can create a custom receiver and make the receiver sleep till the event has happened. That will obviously run your streaming pipelines without having any data to process. Thanks Best Regards On Fri, May

Re: number of executors

2015-05-17 Thread xiaohe lan
bash-4.1$ ps aux | grep SparkSubmit xilan 1704 13.2 1.2 5275520 380244 pts/0 Sl+ 08:39 0:13 /scratch/xilan/jdk1.8.0_45/bin/java -cp

Trying to understand sc.textFile better

2015-05-17 Thread Justin Pihony
All, I am trying to understand the textFile method deeply, but I think my lack of deep Hadoop knowledge is holding me back here. Let me lay out my understanding and maybe you can correct anything that is incorrect When sc.textFile(path) is called, then defaultMinPartitions is used, which

Re: textFileStream Question

2015-05-17 Thread Akhil Das
With file timestamp, you can actually see the finding new files logic from here https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L172 Thanks Best Regards On Fri, May 15, 2015 at 2:25 AM, Vadim Bichutskiy

InferredSchema Example in Spark-SQL

2015-05-17 Thread Rajdeep Dua
Hi All, Was trying the Inferred Schema spart example http://spark.apache.org/docs/latest/sql-programming-guide.html#overview I am getting the following compilation error on the function toRD() value toRD is not a member of org.apache.spark.rdd.RDD[Person] [error] val people =

Re: InferredSchema Example in Spark-SQL

2015-05-17 Thread Ram Sriharsha
you are missing sqlContext.implicits._ On Sun, May 17, 2015 at 8:05 PM, Rajdeep Dua rajdeep@gmail.com wrote: Here are my imports *import* org.apache.spark.SparkContext *import* org.apache.spark.SparkContext._ *import* org.apache.spark.SparkConf *import*

RE: InferredSchema Example in Spark-SQL

2015-05-17 Thread Cheng, Hao
Forgot to import the implicit functions/classes? import sqlContext.implicits._ From: Rajdeep Dua [mailto:rajdeep@gmail.com] Sent: Monday, May 18, 2015 8:08 AM To: user@spark.apache.org Subject: InferredSchema Example in Spark-SQL Hi All, Was trying the Inferred Schema spart example

Re: InferredSchema Example in Spark-SQL

2015-05-17 Thread Ram Sriharsha
you mean toDF() ? (toDF converts the RDD to a DataFrame, in this case inferring schema from the case class) On Sun, May 17, 2015 at 5:07 PM, Rajdeep Dua rajdeep@gmail.com wrote: Hi All, Was trying the Inferred Schema spart example

RE: [SparkStreaming] Is it possible to delay the start of some DStream in the application?

2015-05-17 Thread Haopu Wang
I want to use file stream as input. And I look at SparkStreaming document again, it's saying file stream doesn't need a receiver at all. So I'm wondering if I can control a specific DStream instance. From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent:

Re: Union of checkpointed RDD in Apache Spark has long ( 10 hour) between-stage latency

2015-05-17 Thread Peng Cheng
Looks like this problem has been mentioned before: http://qnalist.com/questions/5666463/downloads-from-s3-exceedingly-slow-when-running-on-spark-ec2 and a temporarily solution is to deploy on a dedicated EMR/S3 configuration. I'll go for that one for a shot. -- View this message in context:

Re: Union of checkpointed RDD in Apache Spark has long ( 10 hour) between-stage latency

2015-05-17 Thread Peng Cheng
Turns out the above thread is unrelated: it was caused by using s3:// instead of s3n://. Which I already avoided in my checkpointDir configuration. -- View this message in context:

RE: InferredSchema Example in Spark-SQL

2015-05-17 Thread Cheng, Hao
Typo? Should be .toDF(), not .toRD() From: Ram Sriharsha [mailto:sriharsha@gmail.com] Sent: Monday, May 18, 2015 8:31 AM To: Rajdeep Dua Cc: user Subject: Re: InferredSchema Example in Spark-SQL you mean toDF() ? (toDF converts the RDD to a DataFrame, in this case inferring schema from the

Re: InferredSchema Example in Spark-SQL

2015-05-17 Thread Rajdeep Dua
Here are my imports *import* org.apache.spark.SparkContext *import* org.apache.spark.SparkContext._ *import* org.apache.spark.SparkConf *import* org.apache.spark.sql.SQLContext *import* org.apache.spark.sql.SchemaRDD On Sun, May 17, 2015 at 8:05 PM, Rajdeep Dua rajdeep@gmail.com wrote:

Re: InferredSchema Example in Spark-SQL

2015-05-17 Thread Rajdeep Dua
Sorry .. toDF() gives an error [error] /home/ubuntu/work/spark/spark-samples/ml-samples/src/main/scala/sql/InferredSchema.scala:24: value toDF is not a member of org.apache.spark.rdd.RDD[Person] [error] val people =

Re: Union of checkpointed RDD in Apache Spark has long ( 10 hour) between-stage latency

2015-05-17 Thread Peng Cheng
BTW: My thread dump of the driver's main thread looks like it is stuck on waiting for Amazon S3 bucket metadata for a long time (which may suggests that I should move checkpointing directory from S3 to HDFS): Thread 1: main (RUNNABLE) java.net.SocketInputStream.socketRead0(Native Method)

Re: InferredSchema Example in Spark-SQL

2015-05-17 Thread Simon Elliston Ball
You mean toDF() not toRD(). It stands for data frame of that makes it easier to remember. Simon On 18 May 2015, at 01:07, Rajdeep Dua rajdeep@gmail.com wrote: Hi All, Was trying the Inferred Schema spart example http://spark.apache.org/docs/latest/sql-programming-guide.html#overview

Re: Spark Streaming and reducing latency

2015-05-17 Thread Akhil Das
With receiver based streaming, you can actually specify spark.streaming.blockInterval which is the interval at which the receiver will fetch data from the source. Default value is 200ms and hence if your batch duration is 1 second, it will produce 5 blocks of data. And yes, with sparkstreaming

RE: [SparkStreaming] Is it possible to delay the start of some DStream in the application?

2015-05-17 Thread Evo Eftimov
You can make ANY standard receiver sleep by implementing a custom Message Deserializer class with sleep method inside it. From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Sunday, May 17, 2015 4:29 PM To: Haopu Wang Cc: user Subject: Re: [SparkStreaming] Is it possible to delay the

Re: Effecient way to fetch all records on a particular node/partition in GraphX

2015-05-17 Thread Ankur Dave
If you know the partition IDs, you can launch a job that runs tasks on only those partitions by calling sc.runJob https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1686. For example, we do this in IndexedRDD

Re: textFileStream Question

2015-05-17 Thread Vadim Bichutskiy
This is cool. Thanks Akhil. ᐧ On Sun, May 17, 2015 at 11:25 AM, Akhil Das ak...@sigmoidanalytics.com wrote: With file timestamp, you can actually see the finding new files logic from here

Big Data Day LA: FREE Big Data Conference in Los Angeles on June 27, 2015

2015-05-17 Thread Slim Baltagi
Please register for the 3rd annual full day ‘Big Data Day LA’ here: - http://bigdatadayla.org • Location: Los Angeles • Date: June 27, 2015 • Completely FREE: Attendance, Food (Breakfast, Lunch Coffee Breaks) and Networking Reception • Vendor neutral • Great lineup

Spark Streaming and reducing latency

2015-05-17 Thread dgoldenberg
I keep hearing the argument that the way Discretized Streams work with Spark Streaming is a lot more of a batch processing algorithm than true streaming. For streaming, one would expect a new item, e.g. in a Kafka topic, to be available to the streaming consumer immediately. With the discretized