date:20150517

Re: PySpark: slicing issue with dataframes

2015-05-17 Thread Davies Liu

Yes, it's a bug, please file a JIRA. On Sun, May 3, 2015 at 10:36 AM, Ali Bajwa ali.ba...@gmail.com wrote: Friendly reminder on this one. Just wanted to get a confirmation that this is not by design before I logged a JIRA Thanks! Ali On Tue, Apr 28, 2015 at 9:53 AM, Ali Bajwa

Re: how to set random seed

2015-05-17 Thread Davies Liu

The python workers used for each stage may be different, this may not work as expected. You can create a Random object, set the seed, use it to do the shuffle(). r = random.Random() r.seek(my_seed) def f(x): r.shuffle(l) rdd.map(f) On Thu, May 14, 2015 at 6:21 AM, Charles Hayden

Re: number of executors

2015-05-17 Thread Akhil Das

Did you try --executor-cores param? While you submit the job, do a ps aux | grep spark-submit and see the exact command parameters. Thanks Best Regards On Sat, May 16, 2015 at 12:31 PM, xiaohe lan zombiexco...@gmail.com wrote: Hi, I have a 5 nodes yarn cluster, I used spark-submit to submit

Effecient way to fetch all records on a particular node/partition in GraphX

2015-05-17 Thread mas

Hi All, I have distributed my RDD into say 10 nodes. I want to fetch the data that resides on a particular node say node 5. How i can achieve this? I have tried mapPartitionWithIndex function to filter the data of that corresponding node, however it is pretty expensive. Any efficient way to do

Re: number of executors

2015-05-17 Thread xiaohe lan

Sorry, them both are assigned task actually. Aggregated Metrics by Executor Executor IDAddressTask TimeTotal TasksFailed TasksSucceeded TasksInput Size / RecordsShuffle Write Size / RecordsShuffle Spill (Memory)Shuffle Spill (Disk)1host1:61841.7 min505640.0 MB / 12318400382.3 MB / 121007701630.4

RE: Spark Streaming and reducing latency

2015-05-17 Thread Evo Eftimov

This is the nature of Spark Streaming as a System Architecture: 1. It is a batch processing system architecture (Spark Batch) optimized for Streaming Data 2. In terms of sources of Latency in such System Architecture, bear in mind that besides “batching”, there is also the

Re: Data partitioning and node tracking in Spark-GraphX

2015-05-17 Thread MUHAMMAD AAMIR

Can you please elaborate the way to fetch the records from a particular partition (node in our case) For example, my RDD is distributed to 10 nodes and i want to fetch the data of one particular node/partition i.e. partition/node with index 5. How can i do this? I have tried

Re: Forbidded : Error Code: 403

2015-05-17 Thread Akhil Das

I think you can try this way also: DataFrame df = sqlContext.load(s3n://ACCESS-KEY:SECRET-KEY@bucket-name/file.avro, com.databricks.spark.avro); Thanks Best Regards On Sat, May 16, 2015 at 2:02 AM, Mohammad Tariq donta...@gmail.com wrote: Thanks for the suggestion Steve. I'll try that out.

Re: [SparkStreaming] Is it possible to delay the start of some DStream in the application?

2015-05-17 Thread Akhil Das

Why not just trigger your batch job with that event? If you really need streaming, then you can create a custom receiver and make the receiver sleep till the event has happened. That will obviously run your streaming pipelines without having any data to process. Thanks Best Regards On Fri, May

Re: number of executors

2015-05-17 Thread xiaohe lan

bash-4.1$ ps aux | grep SparkSubmit xilan 1704 13.2 1.2 5275520 380244 pts/0 Sl+ 08:39 0:13 /scratch/xilan/jdk1.8.0_45/bin/java -cp

Trying to understand sc.textFile better

2015-05-17 Thread Justin Pihony

All, I am trying to understand the textFile method deeply, but I think my lack of deep Hadoop knowledge is holding me back here. Let me lay out my understanding and maybe you can correct anything that is incorrect When sc.textFile(path) is called, then defaultMinPartitions is used, which

Re: textFileStream Question

2015-05-17 Thread Akhil Das

With file timestamp, you can actually see the finding new files logic from here https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L172 Thanks Best Regards On Fri, May 15, 2015 at 2:25 AM, Vadim Bichutskiy

InferredSchema Example in Spark-SQL

2015-05-17 Thread Rajdeep Dua

Hi All, Was trying the Inferred Schema spart example http://spark.apache.org/docs/latest/sql-programming-guide.html#overview I am getting the following compilation error on the function toRD() value toRD is not a member of org.apache.spark.rdd.RDD[Person] [error] val people =

Re: InferredSchema Example in Spark-SQL

2015-05-17 Thread Ram Sriharsha

you are missing sqlContext.implicits._ On Sun, May 17, 2015 at 8:05 PM, Rajdeep Dua rajdeep@gmail.com wrote: Here are my imports *import* org.apache.spark.SparkContext *import* org.apache.spark.SparkContext._ *import* org.apache.spark.SparkConf *import*

RE: InferredSchema Example in Spark-SQL

2015-05-17 Thread Cheng, Hao

Forgot to import the implicit functions/classes? import sqlContext.implicits._ From: Rajdeep Dua [mailto:rajdeep@gmail.com] Sent: Monday, May 18, 2015 8:08 AM To: user@spark.apache.org Subject: InferredSchema Example in Spark-SQL Hi All, Was trying the Inferred Schema spart example

Re: InferredSchema Example in Spark-SQL

2015-05-17 Thread Ram Sriharsha

you mean toDF() ? (toDF converts the RDD to a DataFrame, in this case inferring schema from the case class) On Sun, May 17, 2015 at 5:07 PM, Rajdeep Dua rajdeep@gmail.com wrote: Hi All, Was trying the Inferred Schema spart example

RE: [SparkStreaming] Is it possible to delay the start of some DStream in the application?

2015-05-17 Thread Haopu Wang

I want to use file stream as input. And I look at SparkStreaming document again, it's saying file stream doesn't need a receiver at all. So I'm wondering if I can control a specific DStream instance. From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent:

Re: Union of checkpointed RDD in Apache Spark has long ( 10 hour) between-stage latency

2015-05-17 Thread Peng Cheng

Looks like this problem has been mentioned before: http://qnalist.com/questions/5666463/downloads-from-s3-exceedingly-slow-when-running-on-spark-ec2 and a temporarily solution is to deploy on a dedicated EMR/S3 configuration. I'll go for that one for a shot. -- View this message in context:

Re: Union of checkpointed RDD in Apache Spark has long ( 10 hour) between-stage latency

2015-05-17 Thread Peng Cheng

Turns out the above thread is unrelated: it was caused by using s3:// instead of s3n://. Which I already avoided in my checkpointDir configuration. -- View this message in context:

RE: InferredSchema Example in Spark-SQL

2015-05-17 Thread Cheng, Hao

Typo? Should be .toDF(), not .toRD() From: Ram Sriharsha [mailto:sriharsha@gmail.com] Sent: Monday, May 18, 2015 8:31 AM To: Rajdeep Dua Cc: user Subject: Re: InferredSchema Example in Spark-SQL you mean toDF() ? (toDF converts the RDD to a DataFrame, in this case inferring schema from the

Re: InferredSchema Example in Spark-SQL

2015-05-17 Thread Rajdeep Dua

Here are my imports *import* org.apache.spark.SparkContext *import* org.apache.spark.SparkContext._ *import* org.apache.spark.SparkConf *import* org.apache.spark.sql.SQLContext *import* org.apache.spark.sql.SchemaRDD On Sun, May 17, 2015 at 8:05 PM, Rajdeep Dua rajdeep@gmail.com wrote:

Re: InferredSchema Example in Spark-SQL

2015-05-17 Thread Rajdeep Dua

Sorry .. toDF() gives an error [error] /home/ubuntu/work/spark/spark-samples/ml-samples/src/main/scala/sql/InferredSchema.scala:24: value toDF is not a member of org.apache.spark.rdd.RDD[Person] [error] val people =

Re: Union of checkpointed RDD in Apache Spark has long ( 10 hour) between-stage latency

2015-05-17 Thread Peng Cheng

BTW: My thread dump of the driver's main thread looks like it is stuck on waiting for Amazon S3 bucket metadata for a long time (which may suggests that I should move checkpointing directory from S3 to HDFS): Thread 1: main (RUNNABLE) java.net.SocketInputStream.socketRead0(Native Method)

Re: InferredSchema Example in Spark-SQL

2015-05-17 Thread Simon Elliston Ball

You mean toDF() not toRD(). It stands for data frame of that makes it easier to remember. Simon On 18 May 2015, at 01:07, Rajdeep Dua rajdeep@gmail.com wrote: Hi All, Was trying the Inferred Schema spart example http://spark.apache.org/docs/latest/sql-programming-guide.html#overview

Re: Spark Streaming and reducing latency

2015-05-17 Thread Akhil Das

With receiver based streaming, you can actually specify spark.streaming.blockInterval which is the interval at which the receiver will fetch data from the source. Default value is 200ms and hence if your batch duration is 1 second, it will produce 5 blocks of data. And yes, with sparkstreaming

RE: [SparkStreaming] Is it possible to delay the start of some DStream in the application?

2015-05-17 Thread Evo Eftimov

You can make ANY standard receiver sleep by implementing a custom Message Deserializer class with sleep method inside it. From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Sunday, May 17, 2015 4:29 PM To: Haopu Wang Cc: user Subject: Re: [SparkStreaming] Is it possible to delay the

Re: Effecient way to fetch all records on a particular node/partition in GraphX

2015-05-17 Thread Ankur Dave

If you know the partition IDs, you can launch a job that runs tasks on only those partitions by calling sc.runJob https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1686. For example, we do this in IndexedRDD

Re: textFileStream Question

2015-05-17 Thread Vadim Bichutskiy

This is cool. Thanks Akhil. ᐧ On Sun, May 17, 2015 at 11:25 AM, Akhil Das ak...@sigmoidanalytics.com wrote: With file timestamp, you can actually see the finding new files logic from here

Big Data Day LA: FREE Big Data Conference in Los Angeles on June 27, 2015

2015-05-17 Thread Slim Baltagi

Please register for the 3rd annual full day ‘Big Data Day LA’ here: - http://bigdatadayla.org • Location: Los Angeles • Date: June 27, 2015 • Completely FREE: Attendance, Food (Breakfast, Lunch Coffee Breaks) and Networking Reception • Vendor neutral • Great lineup

Spark Streaming and reducing latency

2015-05-17 Thread dgoldenberg

I keep hearing the argument that the way Discretized Streams work with Spark Streaming is a lot more of a batch processing algorithm than true streaming. For streaming, one would expect a new item, e.g. in a Kafka topic, to be available to the streaming consumer immediately. With the discretized

Re: PySpark: slicing issue with dataframes

Re: how to set random seed

Re: number of executors

Effecient way to fetch all records on a particular node/partition in GraphX

Re: number of executors

RE: Spark Streaming and reducing latency

Re: Data partitioning and node tracking in Spark-GraphX

Re: Forbidded : Error Code: 403

Re: [SparkStreaming] Is it possible to delay the start of some DStream in the application?

Re: number of executors

Trying to understand sc.textFile better

Re: textFileStream Question

InferredSchema Example in Spark-SQL

Re: InferredSchema Example in Spark-SQL

RE: InferredSchema Example in Spark-SQL

Re: InferredSchema Example in Spark-SQL

RE: [SparkStreaming] Is it possible to delay the start of some DStream in the application?

Re: Union of checkpointed RDD in Apache Spark has long ( 10 hour) between-stage latency

Re: Union of checkpointed RDD in Apache Spark has long ( 10 hour) between-stage latency

RE: InferredSchema Example in Spark-SQL

Re: InferredSchema Example in Spark-SQL

Re: InferredSchema Example in Spark-SQL

Re: Union of checkpointed RDD in Apache Spark has long ( 10 hour) between-stage latency

Re: InferredSchema Example in Spark-SQL

Re: Spark Streaming and reducing latency

RE: [SparkStreaming] Is it possible to delay the start of some DStream in the application?

Re: Effecient way to fetch all records on a particular node/partition in GraphX

Re: textFileStream Question

Big Data Day LA: FREE Big Data Conference in Los Angeles on June 27, 2015

Spark Streaming and reducing latency

30 matches

Site Navigation

Mail list logo

Footer information