Yes, it's a bug, please file a JIRA.
On Sun, May 3, 2015 at 10:36 AM, Ali Bajwa ali.ba...@gmail.com wrote:
Friendly reminder on this one. Just wanted to get a confirmation that this
is not by design before I logged a JIRA
Thanks!
Ali
On Tue, Apr 28, 2015 at 9:53 AM, Ali Bajwa
The python workers used for each stage may be different, this may not
work as expected.
You can create a Random object, set the seed, use it to do the shuffle().
r = random.Random()
r.seek(my_seed)
def f(x):
r.shuffle(l)
rdd.map(f)
On Thu, May 14, 2015 at 6:21 AM, Charles Hayden
Did you try --executor-cores param? While you submit the job, do a ps aux |
grep spark-submit and see the exact command parameters.
Thanks
Best Regards
On Sat, May 16, 2015 at 12:31 PM, xiaohe lan zombiexco...@gmail.com wrote:
Hi,
I have a 5 nodes yarn cluster, I used spark-submit to submit
Hi All,
I have distributed my RDD into say 10 nodes. I want to fetch the data that
resides on a particular node say node 5. How i can achieve this?
I have tried mapPartitionWithIndex function to filter the data of that
corresponding node, however it is pretty expensive.
Any efficient way to do
Sorry, them both are assigned task actually.
Aggregated Metrics by Executor
Executor IDAddressTask TimeTotal TasksFailed TasksSucceeded TasksInput Size
/ RecordsShuffle Write Size / RecordsShuffle Spill (Memory)Shuffle Spill
(Disk)1host1:61841.7 min505640.0 MB / 12318400382.3 MB / 121007701630.4
This is the nature of Spark Streaming as a System Architecture:
1. It is a batch processing system architecture (Spark Batch) optimized
for Streaming Data
2. In terms of sources of Latency in such System Architecture, bear in
mind that besides “batching”, there is also the
Can you please elaborate the way to fetch the records from a particular
partition (node in our case) For example, my RDD is distributed to 10 nodes
and i want to fetch the data of one particular node/partition i.e.
partition/node with index 5.
How can i do this?
I have tried
I think you can try this way also:
DataFrame df =
sqlContext.load(s3n://ACCESS-KEY:SECRET-KEY@bucket-name/file.avro,
com.databricks.spark.avro);
Thanks
Best Regards
On Sat, May 16, 2015 at 2:02 AM, Mohammad Tariq donta...@gmail.com wrote:
Thanks for the suggestion Steve. I'll try that out.
Why not just trigger your batch job with that event?
If you really need streaming, then you can create a custom receiver and
make the receiver sleep till the event has happened. That will obviously
run your streaming pipelines without having any data to process.
Thanks
Best Regards
On Fri, May
bash-4.1$ ps aux | grep SparkSubmit
xilan 1704 13.2 1.2 5275520 380244 pts/0 Sl+ 08:39 0:13
/scratch/xilan/jdk1.8.0_45/bin/java -cp
All,
I am trying to understand the textFile method deeply, but I think my
lack of deep Hadoop knowledge is holding me back here. Let me lay out my
understanding and maybe you can correct anything that is incorrect
When sc.textFile(path) is called, then defaultMinPartitions is used,
which
With file timestamp, you can actually see the finding new files logic from
here
https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L172
Thanks
Best Regards
On Fri, May 15, 2015 at 2:25 AM, Vadim Bichutskiy
Hi All,
Was trying the Inferred Schema spart example
http://spark.apache.org/docs/latest/sql-programming-guide.html#overview
I am getting the following compilation error on the function toRD()
value toRD is not a member of org.apache.spark.rdd.RDD[Person]
[error] val people =
you are missing sqlContext.implicits._
On Sun, May 17, 2015 at 8:05 PM, Rajdeep Dua rajdeep@gmail.com wrote:
Here are my imports
*import* org.apache.spark.SparkContext
*import* org.apache.spark.SparkContext._
*import* org.apache.spark.SparkConf
*import*
Forgot to import the implicit functions/classes?
import sqlContext.implicits._
From: Rajdeep Dua [mailto:rajdeep@gmail.com]
Sent: Monday, May 18, 2015 8:08 AM
To: user@spark.apache.org
Subject: InferredSchema Example in Spark-SQL
Hi All,
Was trying the Inferred Schema spart example
you mean toDF() ? (toDF converts the RDD to a DataFrame, in this case
inferring schema from the case class)
On Sun, May 17, 2015 at 5:07 PM, Rajdeep Dua rajdeep@gmail.com wrote:
Hi All,
Was trying the Inferred Schema spart example
I want to use file stream as input. And I look at SparkStreaming
document again, it's saying file stream doesn't need a receiver at all.
So I'm wondering if I can control a specific DStream instance.
From: Evo Eftimov [mailto:evo.efti...@isecc.com]
Sent:
Looks like this problem has been mentioned before:
http://qnalist.com/questions/5666463/downloads-from-s3-exceedingly-slow-when-running-on-spark-ec2
and a temporarily solution is to deploy on a dedicated EMR/S3 configuration.
I'll go for that one for a shot.
--
View this message in context:
Turns out the above thread is unrelated: it was caused by using s3:// instead
of s3n://. Which I already avoided in my checkpointDir configuration.
--
View this message in context:
Typo? Should be .toDF(), not .toRD()
From: Ram Sriharsha [mailto:sriharsha@gmail.com]
Sent: Monday, May 18, 2015 8:31 AM
To: Rajdeep Dua
Cc: user
Subject: Re: InferredSchema Example in Spark-SQL
you mean toDF() ? (toDF converts the RDD to a DataFrame, in this case inferring
schema from the
Here are my imports
*import* org.apache.spark.SparkContext
*import* org.apache.spark.SparkContext._
*import* org.apache.spark.SparkConf
*import* org.apache.spark.sql.SQLContext
*import* org.apache.spark.sql.SchemaRDD
On Sun, May 17, 2015 at 8:05 PM, Rajdeep Dua rajdeep@gmail.com wrote:
Sorry .. toDF() gives an error
[error]
/home/ubuntu/work/spark/spark-samples/ml-samples/src/main/scala/sql/InferredSchema.scala:24:
value toDF is not a member of org.apache.spark.rdd.RDD[Person]
[error] val people =
BTW: My thread dump of the driver's main thread looks like it is stuck on
waiting for Amazon S3 bucket metadata for a long time (which may suggests
that I should move checkpointing directory from S3 to HDFS):
Thread 1: main (RUNNABLE)
java.net.SocketInputStream.socketRead0(Native Method)
You mean toDF() not toRD(). It stands for data frame of that makes it easier to
remember.
Simon
On 18 May 2015, at 01:07, Rajdeep Dua rajdeep@gmail.com wrote:
Hi All,
Was trying the Inferred Schema spart example
http://spark.apache.org/docs/latest/sql-programming-guide.html#overview
With receiver based streaming, you can actually
specify spark.streaming.blockInterval which is the interval at which the
receiver will fetch data from the source. Default value is 200ms and hence
if your batch duration is 1 second, it will produce 5 blocks of data. And
yes, with sparkstreaming
You can make ANY standard receiver sleep by implementing a custom Message
Deserializer class with sleep method inside it.
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Sunday, May 17, 2015 4:29 PM
To: Haopu Wang
Cc: user
Subject: Re: [SparkStreaming] Is it possible to delay the
If you know the partition IDs, you can launch a job that runs tasks on only
those partitions by calling sc.runJob
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1686.
For example, we do this in IndexedRDD
This is cool. Thanks Akhil.
ᐧ
On Sun, May 17, 2015 at 11:25 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
With file timestamp, you can actually see the finding new files logic
from here
Please register for the 3rd annual full day ‘Big Data Day LA’ here: -
http://bigdatadayla.org
• Location: Los Angeles
• Date: June 27, 2015
• Completely FREE: Attendance, Food (Breakfast, Lunch Coffee Breaks) and
Networking Reception
• Vendor neutral
• Great lineup
I keep hearing the argument that the way Discretized Streams work with Spark
Streaming is a lot more of a batch processing algorithm than true streaming.
For streaming, one would expect a new item, e.g. in a Kafka topic, to be
available to the streaming consumer immediately.
With the discretized
30 matches
Mail list logo