Regards,
Shixiong Zhu
2014-03-04 4:23 GMT+08:00 Oleksandr Olgashko alexandrolg...@gmail.com:
Hello. How should i better check two Vector's for equality?
val a = new Vector(Array(1))
val b = new Vector(Array(1))
println(a == b)
// false
().take(5)
Best Regards,
Shixiong Zhu
2014-03-09 13:30 GMT+08:00 Kane kane.ist...@gmail.com:
when i try to open sequence file:
val t2 = sc.sequenceFile(/user/hdfs/e1Mseq, classOf[String],
classOf[String])
t2.groupByKey().take(5)
I get:
org.apache.spark.SparkException: Job aborted: Task 25.0:0
to create a RDD from a collection.
Best Regards,
Shixiong Zhu
2014-03-19 20:52 GMT+08:00 Yana Kadiyska yana.kadiy...@gmail.com:
Not sure what you mean by not getting information how to join. If
you mean that you can't see the result I believe you need to collect
the result of the join
solution is using rdd.partitionBy(new HashPartitioner(1)) to
make sure there is only one partition. But that's not efficient for big
input.
Best Regards,
Shixiong Zhu
2014-04-02 11:10 GMT+08:00 Thierry Herrmann thierry.herrm...@gmail.com:
I'm new to Spark, but isn't this a pure scala question
You can use JavaPairRDD.saveAsHadoopFile/saveAsNewAPIHadoopFile.
Best Regards,
Shixiong Zhu
2014-06-20 14:22 GMT+08:00 abhiguruvayya sharath.abhis...@gmail.com:
Any inputs on this will be helpful.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How
Best Regards,
Shixiong Zhu
2014-08-14 22:11 GMT+08:00 Christopher Nguyen c...@adatao.com:
Hi Hoai-Thu, the issue of private default constructor is unlikely the
cause here, since Lance was already able to load/deserialize the model
object.
And on that side topic, I wish all serdes libraries
I think in the following case
class Foo { def foo() = Array(1.0) }
val t = new Foo
val m = t.foo
val r1 = sc.parallelize(List(1, 2, 3))
val r2 = r1.map(_ + m(0))
r2.toArray
Spark should not serialize t. But looks it will.
Best Regards,
Shixiong Zhu
2014-08-14 23:22 GMT+08:00 lancezhange
and these values cannot fit
into memory. Spilling data to disk helps nothing because cogroup needs to
read all values for a key into memory.
Any suggestion to solve these OOM cases? Thank you,.
Best Regards,
Shixiong Zhu
to check if anyone has similar problem and
better solution.
Best Regards,
Shixiong Zhu
2014-10-28 0:13 GMT+08:00 Holden Karau hol...@pigscanfly.ca:
On Monday, October 27, 2014, Shixiong Zhu zsxw...@gmail.com wrote:
We encountered some special OOM cases of cogroup when the data in one
Are you using spark standalone mode? If so, you need to
set spark.io.compression.codec for all workers.
Best Regards,
Shixiong Zhu
2014-10-28 10:37 GMT+08:00 buring qyqb...@gmail.com:
Here is error log,I abstract as follows:
INFO [binaryTest---main]: before first
WARN
I mean updating the spark conf not only in the driver, but also in the
Spark Workers.
Because the driver configurations cannot be read by the Executors, they
still use the default spark.io.compression.codec to deserialize the tasks.
Best Regards,
Shixiong Zhu
2014-10-28 16:39 GMT+08:00 buring
Or def getAs[T](i: Int): T
Best Regards,
Shixiong Zhu
2014-10-29 13:16 GMT+08:00 Zhan Zhang zzh...@hortonworks.com:
Can you use row(i).asInstanceOf[]
Thanks.
Zhan Zhang
On Oct 28, 2014, at 5:03 PM, Mohammed Guller moham...@glassbeam.com
wrote:
Hi –
The Spark SQL Row class has
is not persisted, Spark needs to
load the data again. You can call RDD.cache to persist the RDD in the
memory.
Best Regards,
Shixiong Zhu
2014-11-06 11:35 GMT+08:00 nsareen nsar...@gmail.com:
I noticed a behaviour where it was observed that, if i'm using
val temp = sc.parallelize ( 1 to 10
Two limitations we found here:
http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemory-in-quot-cogroup-quot-td17349.html
Best Regards,
Shixiong Zhu
2014-11-06 2:04 GMT+08:00 Yangcheng Huang yangcheng.hu...@huawei.com:
Hi
One question about the power of spark.shuffle.spill –
(I
.
Best Regards,
Shixiong Zhu
2014-11-06 7:56 GMT+08:00 ankits ankitso...@gmail.com:
In my spark job, I have a loop something like this:
bla.forEachRdd(rdd = {
//init some vars
rdd.forEachPartition(partiton = {
//init some vars
partition.foreach(kv = {
...
I am seeing
Will this work even with Kryo Serialization ?
Now spark.closure.serializer must be
org.apache.spark.serializer.JavaSerializer. Therefore the serialization
closure functions won’t be involved with Kryo. Kryo is only used to
serialize the data.
Best Regards,
Shixiong Zhu
2014-11-07 12:27 GMT+08
it? Is there
a SparkContext field in the outer class?
Best Regards,
Shixiong Zhu
2014-10-28 0:28 GMT+08:00 octavian.ganea octavian.ga...@inf.ethz.ch:
I am also using spark 1.1.0 and I ran it on a cluster of nodes (it works
if I
run it in local mode! )
If I put the accumulator inside the for loop, everything
Now it doesn't support such query. I can easily reproduce it. Created a
JIRA here: https://issues.apache.org/jira/browse/SPARK-4296
Best Regards,
Shixiong Zhu
2014-11-07 16:44 GMT+08:00 Tridib Samanta tridib.sama...@live.com:
I am trying to group by on a calculated field. Is it supported
Could you provide the code of hbaseQuery? It maybe doesn't support to
execute in parallel.
Best Regards,
Shixiong Zhu
2014-11-12 14:32 GMT+08:00 qiaou qiaou8...@gmail.com:
Hi:
I got a problem with using the union method of RDD
things like this
I get a function like
def hbaseQuery
get the same value of the broadcast variable
(e.g. if the variable is shipped to a new node later).
Best Regards,
Shixiong Zhu
2014-11-12 15:20 GMT+08:00 qiaou qiaou8...@gmail.com:
this work!
but can you explain why should use like this?
--
qiaou
已使用 Sparrow http://www.sparrowmailapp.com
to create some
big enough tasks. Of cause, you can reduce `spark.locality.wait`, but it
may be not efficient because it still creates many tiny tasks.
Best Regards,
Shixiong Zhu
2014-11-22 17:17 GMT+08:00 Akhil Das ak...@sigmoidanalytics.com:
What is your cluster setup? are you running a worker
: scala.math.BigInt = 100
Best Regards,
Shixiong Zhu
2014-11-25 10:31 GMT+08:00 Peter Thai thai.pe...@gmail.com:
Hello!
Does anyone know why I may be receiving negative final accumulator values?
Thanks!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com
4096MB is greater than Int.MaxValue and it will be overflow in Spark.
Please set it less then 4096.
Best Regards,
Shixiong Zhu
2014-12-01 13:14 GMT+08:00 Ke Wang jkx...@gmail.com:
I meet the same problem, did you solve it ?
--
View this message in context:
http://apache-spark-user-list
Sorry. Should be not greater than 2048. 2047 is the greatest value.
Best Regards,
Shixiong Zhu
2014-12-01 13:20 GMT+08:00 Shixiong Zhu zsxw...@gmail.com:
4096MB is greater than Int.MaxValue and it will be overflow in Spark.
Please set it less then 4096.
Best Regards,
Shixiong Zhu
2014-12
Created a JIRA to track it: https://issues.apache.org/jira/browse/SPARK-4664
Best Regards,
Shixiong Zhu
2014-12-01 13:22 GMT+08:00 Shixiong Zhu zsxw...@gmail.com:
Sorry. Should be not greater than 2048. 2047 is the greatest value.
Best Regards,
Shixiong Zhu
2014-12-01 13:20 GMT+08:00
Don't set `spark.akka.frameSize` to 1. The max value of
`spark.akka.frameSize` is 2047. The unit is MB.
Best Regards,
Shixiong Zhu
2014-12-01 0:51 GMT+08:00 Yanbo yanboha...@gmail.com:
Try to use spark-shell --conf spark.akka.frameSize=1
在 2014年12月1日,上午12:25,Brian Dolan buddha_
What's the status of this application in the yarn web UI?
Best Regards,
Shixiong Zhu
2014-12-05 17:22 GMT+08:00 LinQili lin_q...@outlook.com:
I tried anather test code:
def main(args: Array[String]) {
if (args.length != 1) {
Util.printLog(ERROR, Args error - arg1: BASE_DIR
not send it back to the
client.
spark-submit will return 1 when Yarn reports the ApplicationMaster failed.
Best Regards,
Shixiong Zhu
2014-12-06 1:59 GMT+08:00 LinQili lin_q...@outlook.com:
You mean the localhost:4040 or the application master web ui?
Sent from my iPhone
On Dec 5, 2014, at 17:26
,
Shixiong Zhu
2014-12-10 20:13 GMT+08:00 Johannes Simon johannes.si...@mail.de:
Hi!
I have been using spark a lot recently and it's been running really well
and fast, but now when I increase the data size, it's starting to run into
problems:
I have an RDD in the form of (String, Iterable[String
Good catch. `Join` should use `Iterator`, too. I open an JIRA here:
https://issues.apache.org/jira/browse/SPARK-4824
Best Regards,
Shixiong Zhu
2014-12-10 21:35 GMT+08:00 Johannes Simon johannes.si...@mail.de:
Hi!
Using an iterator solved the problem! I've been chewing on this for days,
so
Just point out a bug in your codes. You should not use `mapPartitions` like
that. For details, I recommend Section setup() and cleanup() in Sean
Owen's post:
http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-apache-spark/
Best Regards,
Shixiong Zhu
2014-12-14 16:35 GMT+08
Could you post the stack trace?
Best Regards,
Shixiong Zhu
2014-12-16 23:21 GMT+08:00 richiesgr richie...@gmail.com:
Hi
This time I need expert.
On 1.1.1 and only in cluster (standalone or EC2)
when I use this code :
countersPublishers.foreachRDD(rdd = {
rdd.foreachPartition
@Rui do you mean the spark-core jar in the maven central repo
are incompatible with the same version of the the official pre-built Spark
binary? That's really weird. I thought they should have used the same codes.
Best Regards,
Shixiong Zhu
2014-12-18 17:22 GMT+08:00 Sean Owen so...@cloudera.com
Congrats!
A little question about this release: Which commit is this release based
on? v1.2.0 and v1.2.0-rc2 are pointed to different commits in
https://github.com/apache/spark/releases
Best Regards,
Shixiong Zhu
2014-12-19 16:52 GMT+08:00 Patrick Wendell pwend...@gmail.com:
I'm happy
I encountered the following issue when enabling dynamicAllocation. You may
want to take a look at it.
https://issues.apache.org/jira/browse/SPARK-4951
Best Regards,
Shixiong Zhu
2014-12-28 2:07 GMT+08:00 Tsuyoshi OZAWA ozawa.tsuyo...@gmail.com:
Hi Anders,
I faced the same issue as you
The Iterable from cogroup is CompactBuffer, which is already materialized.
It's not a lazy Iterable. So now Spark cannot handle skewed data that some
key has too many values that cannot be fit into the memory.
The unit of spark.akka.frameSize is MB. The max value is 2047.
Best Regards,
Shixiong Zhu
2015-02-05 1:16 GMT+08:00 sahanbull sa...@skimlinks.com:
I am trying to run a spark application with
-Dspark.executor.memory=30g -Dspark.kryoserializer.buffer.max.mb=2000
-Dspark.akka.frameSize=1
Could you clarify why you need a 10G akka frame size?
Best Regards,
Shixiong Zhu
2015-02-05 9:20 GMT+08:00 Shixiong Zhu zsxw...@gmail.com:
The unit of spark.akka.frameSize is MB. The max value is 2047.
Best Regards,
Shixiong Zhu
2015-02-05 1:16 GMT+08:00 sahanbull sa...@skimlinks.com:
I
It's because you committed the job in Windows to a Hadoop cluster running
in Linux. Spark has not yet supported it. See
https://issues.apache.org/jira/browse/SPARK-1825
Best Regards,
Shixiong Zhu
2015-01-28 17:35 GMT+08:00 Marco marco@gmail.com:
I've created a spark app, which runs fine
It's a bug that has been fixed in https://github.com/apache/spark/pull/4258
but not yet been merged.
Best Regards,
Shixiong Zhu
2015-02-02 10:08 GMT+08:00 Arun Lists lists.a...@gmail.com:
Here is the relevant snippet of code in my main program
call `map(_.toList)` to convert `CompactBuffer` to `List`
Best Regards,
Shixiong Zhu
2015-01-04 12:08 GMT+08:00 Sanjay Subramanian
sanjaysubraman...@yahoo.com.invalid:
hi
Take a look at the code here I wrote
https://raw.githubusercontent.com/sanjaysubramanian/msfx_scala/master/src/main
`--jars` accepts a comma-separated list of jars. See the usage about
`--jars`
--jars JARS Comma-separated list of local jars to include on the driver and
executor classpaths.
Best Regards,
Shixiong Zhu
2015-01-08 19:23 GMT+08:00 Guillermo Ortiz konstt2...@gmail.com:
I'm trying to execute
. For
me, I will addd -Dhbase.profile=hadoop2 to the build instruction so that
the examples project will use a haoop2-compatible hbase.
Best Regards,
Shixiong Zhu
2015-01-08 0:30 GMT+08:00 Antony Mayi antonym...@yahoo.com.invalid:
thanks, I found the issue, I was including
/usr/lib/spark/lib
cases are the second one, we set
spark.scheduler.executorTaskBlacklistTime to 3 to solve such No
space left on device errors. So if a task runs unsuccessfully in some
executor, it won't be scheduled to the same executor in 30 seconds.
Best Regards,
Shixiong Zhu
2015-03-16 17:40 GMT+08:00 Jianshi
.
Best Regards,
Shixiong Zhu
2015-03-13 9:37 GMT+08:00 Soila Pertet Kavulya skavu...@gmail.com:
Does Spark support skewed joins similar to Pig which distributes large
keys over multiple partitions? I tried using the RangePartitioner but
I am still experiencing failures because some keys are too
There is no configuration for it now.
Best Regards,
Shixiong Zhu
2015-03-26 7:13 GMT+08:00 Manoj Samel manojsamelt...@gmail.com:
There may be firewall rules limiting the ports between host running spark
and the hadoop cluster. In that case, not all ports are allowed.
Can it be a range
LGTM. Could you open a JIRA and send a PR? Thanks.
Best Regards,
Shixiong Zhu
2015-03-28 7:14 GMT+08:00 Manoj Samel manojsamelt...@gmail.com:
I looked @ the 1.3.0 code and figured where this can be added
In org.apache.spark.deploy.yarn ApplicationMaster.scala:282 is
actorSystem
Could you paste the whole stack trace here?
Best Regards,
Shixiong Zhu
2015-03-31 2:26 GMT+08:00 sparkdi shopaddr1...@dubna.us:
I have the same problem, i.e. exception with the same call stack when I
start
either pyspark or spark-shell. I use spark-1.3.0-bin-hadoop2.4 on ubuntu
14.10.
bin
Thanks for the log. It's really helpful. I created a JIRA to explain why it
will happen: https://issues.apache.org/jira/browse/SPARK-6640
However, will this error always happens in your environment?
Best Regards,
Shixiong Zhu
2015-03-31 22:36 GMT+08:00 sparkdi shopaddr1...@dubna.us
RDD is not thread-safe. You should not use it in multiple threads.
Best Regards,
Shixiong Zhu
2015-02-27 23:14 GMT+08:00 rok rokros...@gmail.com:
I'm seeing this java.util.NoSuchElementException: key not found: exception
pop up sometimes when I run operations on an RDD from multiple threads
Rdd.foreach runs in the executors. You should use `collect` to fetch data
to the driver. E.g.,
myRdd.collect().foreach {
node = {
mp(node) = 1
}
}
Best Regards,
Shixiong Zhu
2015-02-25 4:00 GMT+08:00 Vijayasarathy Kannan kvi...@vt.edu:
Thanks, but it still doesn't seem
It's a random port to avoid port conflicts, since multiple AMs can run in
the same machine. Why do you need a fixed port?
Best Regards,
Shixiong Zhu
2015-03-26 6:49 GMT+08:00 Manoj Samel manojsamelt...@gmail.com:
Spark 1.3, Hadoop 2.5, Kerbeors
When running spark-shell in yarn client mode
it from Eclipse on local[*].
On Sun, Apr 19, 2015 at 7:57 PM, Praveen Balaji
secondorderpolynom...@gmail.com wrote:
Thanks Shixiong. I'll try this.
On Sun, Apr 19, 2015, 7:36 PM Shixiong Zhu zsxw...@gmail.com wrote:
The problem is the code you use to test:
sc.parallelize(List(1, 2, 3
The problem is the code you use to test:
sc.parallelize(List(1, 2, 3)).map(throw new
SparkException(test)).collect();
is like the following example:
def foo: Int = Nothing = {
throw new SparkException(test)
}
sc.parallelize(List(1, 2, 3)).map(foo).collect();
So actually the Spark jobs do not
://spark.apache.org/docs/latest/running-on-yarn.html
Best Regards,
Shixiong Zhu
2015-04-30 1:00 GMT-07:00 xiaohe lan zombiexco...@gmail.com:
Hi Madhvi,
If I only install spark on one node, and use spark-submit to run an
application, which are the Worker nodes? Any where are the executors ?
Thanks,
Xiaohe
spark.history.fs.logDirectory is for the history server. For Spark
applications, they should use spark.eventLog.dir. Since you commented out
spark.eventLog.dir, it will be /tmp/spark-events. And this folder does
not exits.
Best Regards,
Shixiong Zhu
2015-04-29 23:22 GMT-07:00 James King jakwebin
The configuration key should be spark.akka.askTimeout for this timeout.
The time unit is seconds.
Best Regards,
Shixiong(Ryan) Zhu
2015-04-26 15:15 GMT-07:00 Deepak Gopalakrishnan dgk...@gmail.com:
Hello,
Just to add a bit more context :
I have done that in the code, but I cannot see it
The history server may need several hours to start if you have a lot of
event logs. Is it stuck, or still replaying logs?
Best Regards,
Shixiong Zhu
2015-05-07 11:03 GMT-07:00 Marcelo Vanzin van...@cloudera.com:
Can you get a jstack for the process? Maybe it's stuck somewhere.
On Thu, May 7
SPARK-5522 is really cool. Didn't notice it.
Best Regards,
Shixiong Zhu
2015-05-07 11:36 GMT-07:00 Marcelo Vanzin van...@cloudera.com:
That shouldn't be true in 1.3 (see SPARK-5522).
On Thu, May 7, 2015 at 11:33 AM, Shixiong Zhu zsxw...@gmail.com wrote:
The history server may need several
You are using Scala 2.11 with 2.10 libraries. You can change
org.apache.spark % spark-streaming_2.10 % 1.3.1
to
org.apache.spark %% spark-streaming % 1.3.1
And sbt will use the corresponding libraries according to your Scala
version.
Best Regards,
Shixiong Zhu
2015-05-06 16:21 GMT-07:00
Could your provide the full driver log? Looks like a bug. Thank you!
Best Regards,
Shixiong Zhu
2015-05-13 14:02 GMT-07:00 Giovanni Paolo Gibilisco gibb...@gmail.com:
Hi,
I'm trying to run an application that uses a Hive context to perform some
queries over JSON files.
The code
I just checked the codes about creating OutputCommitCoordinator. Could you
reproduce this issue? If so, could you provide details about how to
reproduce it?
Best Regards,
Shixiong(Ryan) Zhu
2015-04-16 13:27 GMT+08:00 Canoe canoe...@gmail.com:
13119 Exception in thread main
Could you see something like this in the console?
---
Time: 142905487 ms
---
Best Regards,
Shixiong(Ryan) Zhu
2015-04-15 2:11 GMT+08:00 Shushant Arora shushantaror...@gmail.com:
Hi
I am running a spark
: 142905487 ms strings gets printed on console.
No output is getting printed.
And timeinterval between two strings of form ( time:ms)is very less
than Streaming Duration set in program.
On Wed, Apr 15, 2015 at 5:11 AM, Shixiong Zhu zsxw...@gmail.com wrote:
Could you see something like
Cleaner
java.lang.NoClassDefFoundError: 0
at
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:149)
Best Regards,
Shixiong Zhu
2015-06-03 0:08 GMT+08:00 Ryan Williams ryan.blake.willi...@gmail.com:
I think
the communication
between driver and executors? Because this is an ongoing work, there is no
blog now. But you can find more details in this umbrella JIRA:
https://issues.apache.org/jira/browse/SPARK-5293
Best Regards,
Shixiong Zhu
2015-06-10 20:33 GMT+08:00 huangzheng 1106944...@qq.com:
Hi all
You should not call `jssc.stop(true);` in a StreamingListener. It will
cause a dead-lock: `jssc.stop` won't return until `listenerBus` exits. But
since `jssc.stop` blocks `StreamingListener`, `listenerBus` cannot exit.
Best Regards,
Shixiong Zhu
2015-06-04 0:39 GMT+08:00 dgoldenberg dgoldenberg
How about other jobs? Is it an executor log, or a driver log? Could you
post other logs near this error, please? Thank you.
Best Regards,
Shixiong Zhu
2015-06-02 17:11 GMT+08:00 Anders Arpteg arp...@spotify.com:
Just compiled Spark 1.4.0-rc3 for Yarn 2.2 and tried running a job that
worked
Before running your script, could you confirm that
/data/software/spark-1.3.1-bin-2.4.0/applications/pss.am.core-1.0-SNAPSHOT-shaded.jar
exists? You might forget to build this jar.
Best Regards,
Shixiong Zhu
2015-07-06 18:14 GMT+08:00 bit1...@163.com bit1...@163.com:
Hi,
I have following
You can set spark.ui.enabled to false to disable the Web UI.
Best Regards,
Shixiong Zhu
2015-07-06 17:05 GMT+08:00 luohui20...@sina.com:
Hello there,
I heard that there is some way to shutdown Spark WEB UI, is there a
configuration to support this?
Thank you
`ssc.stop` as a the shutdown hook. But stopGracefully
should be false.
Best Regards,
Shixiong Zhu
2015-05-20 21:59 GMT-07:00 Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com:
Thanks Tathagata for making this change..
Dibyendu
On Thu, May 21, 2015 at 8:24 AM, Tathagata Das t
file. Could you convert your
data to String using map and use saveAsTextFile or other save methods?
Best Regards,
Shixiong Zhu
2015-08-14 11:02 GMT+08:00 kale 805654...@qq.com:
-
To unsubscribe, e-mail: user-unsubscr
Oh, I see. That's the total time of executing a query in Spark. Then the
difference is reasonable, considering Spark has much more work to do, e.g.,
launching tasks in executors.
Best Regards,
Shixiong Zhu
2015-07-26 16:16 GMT+08:00 Louis Hust louis.h...@gmail.com:
Look at the given url
Could you clarify how you measure the Spark time cost? Is it the total time
of running the query? If so, it's possible because the overhead of
Spark dominates for small queries.
Best Regards,
Shixiong Zhu
2015-07-26 15:56 GMT+08:00 Jerrick Hoang jerrickho...@gmail.com:
how big is the dataset
to find similar
issues in the PR build.
Best Regards,
Shixiong Zhu
2015-11-09 18:47 GMT-08:00 Ted Yu <yuzhih...@gmail.com>:
> Created https://github.com/apache/spark/pull/9585
>
> Cheers
>
> On Mon, Nov 9, 2015 at 6:39 PM, Josh Rosen <joshro...@databricks.com>
> wrote
In addition, if you have more than two text files, you can just put them
into a Seq and use "reduce(_ ++ _)".
Best Regards,
Shixiong Zhu
2015-11-11 10:21 GMT-08:00 Jakob Odersky <joder...@gmail.com>:
> Hey Jeff,
> Do you mean reading from multiple text files? In that c
You should use `SparkConf.set` rather than `SparkConf.setExecutorEnv`. For
driver configurations, you need to set them before starting your
application. You can use the `--conf` argument before running
`spark-submit`.
Best Regards,
Shixiong Zhu
2015-11-04 15:55 GMT-08:00 William Li
"trackStateByKey" is about to be added in 1.6 to resolve the performance
issue of "updateStateByKey". You can take a look at
https://issues.apache.org/jira/browse/SPARK-2629 and
https://github.com/apache/spark/pull/9256
Thanks for reporting it Terry. I submitted a PR to fix it:
https://github.com/apache/spark/pull/9132
Best Regards,
Shixiong Zhu
2015-10-15 2:39 GMT+08:00 Reynold Xin <r...@databricks.com>:
> +dev list
>
> On Wed, Oct 14, 2015 at 1:07 AM, Terry Hoo <hujie.ea...@gmail.co
Scala 2.10 REPL javap doesn't support Java7 or Java8. It was fixed in Scala
2.11. See https://issues.scala-lang.org/browse/SI-4936
Best Regards,
Shixiong Zhu
2015-10-15 4:19 GMT+08:00 Robert Dodier <robert.dod...@gmail.com>:
> Hi,
>
> I am working with Spark 1.5.1 (o
Which mode are you using? For standalone, it's
org.apache.spark.deploy.worker.Worker. For Yarn and Mesos, Spark just
submits its request to them and they will schedule processes for Spark.
Best Regards,
Shixiong Zhu
2015-10-12 20:12 GMT+08:00 Muhammad Haseeb Javed <11besemja...@seecs.edu
In addition, you cannot turn off JobListener and SQLListener now...
Best Regards,
Shixiong Zhu
2015-10-13 11:59 GMT+08:00 Shixiong Zhu <zsxw...@gmail.com>:
> Is your query very complicated? Could you provide the output of `explain`
> your query that consumes an excessive amou
Could you show how did you set the configurations? You need to set these
configurations before creating SparkContext and SQLContext.
Moreover, the history sever doesn't support SQL UI. So
"spark.eventLog.enabled=true" doesn't work now.
Best Regards,
Shixiong Zhu
2015-10-13 2:01
You don't need to care about this sleep. It runs in a separate thread and
usually won't affect the performance of your application.
Best Regards,
Shixiong Zhu
2015-10-09 6:03 GMT+08:00 yael aharon <yael.aharo...@gmail.com>:
> Hello,
> I am working on improving the performance
Is your query very complicated? Could you provide the output of `explain`
your query that consumes an excessive amount of memory? If this is a small
query, there may be a bug that leaks memory in SQLListener.
Best Regards,
Shixiong Zhu
2015-10-13 11:44 GMT+08:00 Nicholas Pritchard
Each ReceiverInputDStream will create one Receiver. If you only use
one ReceiverInputDStream, there will be only one Receiver in the cluster.
But if you create multiple ReceiverInputDStreams, there will be multiple
Receivers.
Best Regards,
Shixiong Zhu
2015-10-12 23:47 GMT+08:00 Something
Could you print the content of RDD to check if there are multiple values
for a key in a batch?
Best Regards,
Shixiong Zhu
2015-10-12 18:25 GMT+08:00 Sathiskumar <sathish.palaniap...@gmail.com>:
> I'm running a Spark Streaming application for every 10 seconds, its job is
> to
> co
DStream must be Serializable, it's metadata checkpointing. But you can use
KryoSerializer for data checkpointing. The data checkpointing uses
RDD.checkpoint which can be set by spark.serializer.
Best Regards,
Shixiong Zhu
2015-07-08 3:43 GMT+08:00 Chen Song chen.song...@gmail.com:
In Spark
MemoryStore.ensureFreeSpace for details.
Best Regards,
Shixiong Zhu
2015-07-09 19:17 GMT+08:00 Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com:
Hi ,
Just would like to clarify few doubts I have how BlockManager behaves .
This is mostly in regards to Spark Streaming Context .
There are two
r1 = context.wholeTextFiles(...)
val r2 = r1.flatMap(s - ...)
r2.persist(StorageLevel.MEMORY)
val r3 = r2.filter(...)...
r3.saveAsTextFile(...)
val r4 = r2.map(...)...
r4.saveAsTextFile(...)
See
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
Best Regards,
Shixiong Zhu
Hao,
I can reproduce it using the master branch. I'm curious why you cannot
reproduce it. Did you check if the input HadoopRDD did have two partitions?
My test code is
val df = sqlContext.read.json(examples/src/main/resources/people.json)
df.show()
Best Regards,
Shixiong Zhu
2015-08-25 13:01
/org/apache/spark/sql/execution/SparkPlan.scala#L185
Best Regards,
Shixiong Zhu
2015-08-25 8:11 GMT+08:00 Jeff Zhang zjf...@gmail.com:
Hi Cheng,
I know that sqlContext.read will trigger one spark job to infer the
schema. What I mean is DataFrame#show cost 2 spark jobs. So overall it
would cost
That's two jobs. `SparkPlan.executeTake` will call `runJob` twice in this
case.
Best Regards,
Shixiong Zhu
2015-08-25 14:01 GMT+08:00 Cheng, Hao hao.ch...@intel.com:
O, Sorry, I miss reading your reply!
I know the minimum tasks will be 2 for scanning, but Jeff is talking about
2 jobs
The folder is in "/tmp" by default. Could you use "df -h" to check the free
space of /tmp?
Best Regards,
Shixiong Zhu
2015-09-05 9:50 GMT+08:00 shenyan zhen <shenya...@gmail.com>:
> Has anyone seen this error? Not sure which dir the program was trying to
> write
(i)
i1.readObject()
Could you provide the "explain" output? It would be helpful to find the
circular references.
Best Regards,
Shixiong Zhu
2015-09-05 0:26 GMT+08:00 Jeff Jones <jjo...@adaptivebiotech.com>:
> We are using Scala 2.11 for a driver program that is running
I mean JavaSparkContext.setLogLevel. You can use it like this:
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
Durations.seconds(2));
jssc.sparkContext().setLogLevel(...);
Best Regards,
Shixiong Zhu
2015-09-29 22:07 GMT+08:00 Ashish Soni <asoni.le...@gmail.com>:
> I
You can use JavaSparkContext.setLogLevel to set the log level in your codes.
Best Regards,
Shixiong Zhu
2015-09-28 22:55 GMT+08:00 Ashish Soni <asoni.le...@gmail.com>:
> I am not running it using spark submit , i am running locally inside
> Eclipse IDE , how i set this usi
Right, you can use SparkContext and SQLContext in multiple threads. They
are thread safe.
Best Regards,
Shixiong Zhu
2015-10-01 4:57 GMT+08:00 <saif.a.ell...@wellsfargo.com>:
> Hi all,
>
> I have a process where I do some calculations on each one of the columns
> of a datafram
Do you have the log? Looks like some exceptions in your codes make
SparkContext stopped.
Best Regards,
Shixiong Zhu
2015-09-30 17:30 GMT+08:00 tranan <tra...@gmail.com>:
> Hello All,
>
> I have several Spark Streaming applications running on Standalone mode in
> Spark 1.5.
Do you have the log file? It may be because of wrong settings.
Best Regards,
Shixiong Zhu
2015-10-01 7:32 GMT+08:00 markluk <m...@juicero.com>:
> I setup a new Spark cluster. My worker node is dying with the following
> exception.
>
> Caused by: java.util.concurrent.Timeout
1 - 100 of 119 matches
Mail list logo