Hi, you can refer to https://issues.apache.org/jira/browse/SPARK-14083 for
more detail.
For performance issue,it is better to using the DataFrame than DataSet API.
On Sat, Feb 25, 2017 at 2:45 AM, Jacek Laskowski wrote:
> Hi Justin,
>
> I have never seen such a list. I think
Hi, John:
I am very intersting in your experiment, How can you get that RDD
serialization cost lots of time, from the log or some other tools?
On Fri, Mar 11, 2016 at 8:46 PM, John Lilley
wrote:
> Andrew,
>
>
>
> We conducted some tests for using Graphx to solve
Hi, All:
I modify the spark code and try to use some extra jars in Spark, the
extra jars is published in my local maven repository using* mvn install*.
However the sbt can not find this jars file, even I can find this jar
fils under* /home/myname/.m2/resposiroty*.
I can guarantee
. -Xiangrui
On Wed, Feb 11, 2015 at 1:35 AM, lihu lihu...@gmail.com wrote:
I just want to make the best use of CPU, and test the performance of
spark
if there is a lot of task in a single node.
On Wed, Feb 11, 2015 at 5:29 PM, Sean Owen so...@cloudera.com wrote:
Good, worth double
I try to use the multi-thread to use the Spark SQL query.
some sample code just like this:
val sqlContext = new SqlContext(sc)
val rdd_query = sc.parallelize(data, part)
rdd_query.registerTempTable(MyTable)
sqlContext.cacheTable(MyTable)
val serverPool = Executors.newFixedThreadPool(3)
val
executors for more information.
On Thu, Feb 12, 2015 at 2:34 AM, lihu lihu...@gmail.com wrote:
I try to use the multi-thread to use the Spark SQL query.
some sample code just like this:
val sqlContext = new SqlContext(sc)
val rdd_query = sc.parallelize(data, part)
rdd_query.registerTempTable
have 24 cores?
On Wed, Feb 11, 2015 at 9:03 AM, lihu lihu...@gmail.com wrote:
I give 50GB to the executor, so it seem that there is no reason the
memory
is not enough.
On Wed, Feb 11, 2015 at 4:50 PM, Sean Owen so...@cloudera.com wrote:
Meaning, you have 128GB per machine but how much
Hi,
I run the kmeans(MLlib) in a cluster with 12 workers. Every work own
a 128G RAM, 24Core. I run 48 task in one machine. the total data is just
40GB.
When the dimension of the data set is about 10^7, for every task the
duration is about 30s, but the cost for GC is about 20s.
When I
and much more
thoroughly tested version under the property
spark.shuffle.blockTransferService,
which is set to netty by default.
On Tue, Jan 13, 2015 at 9:26 PM, lihu lihu...@gmail.com wrote:
Hi,
I just test groupByKey method on a 100GB data, the cluster is 20
machine, each with 125GB RAM
Hi,
I just test groupByKey method on a 100GB data, the cluster is 20
machine, each with 125GB RAM.
At first I set conf.set(spark.shuffle.use.netty, false) and run
the experiment, and then I set conf.set(spark.shuffle.use.netty, true)
again to re-run the experiment, but at the latter
there is no way to avoid shuffle if you use combine by key, no matter if
your data is cached in memory, because the shuffle write must write the
data into disk. And It seem that spark can not guarantee the similar
key(K1) goes to the Container_X.
you can use the tmpfs for your shuffle dir, this
By the way, I am not sure enough wether the shuffle key can go into the
similar container.
How about your scene? do you need use lots of Broadcast? If not, It will be
better to focus more on other thing.
At this time, there is not more better method than TorrentBroadcast. Though
one-by-one, but after one node get the data, it can act as the data source
immediately.
Can this assembly get faster if we do not need the Spark SQL or some other
component in spark ? such as we only need the core of spark.
On Wed, Nov 26, 2014 at 3:37 PM, lihu lihu...@gmail.com wrote:
Matei, sorry for my last typo error. And the tip can improve about 30s in
my computer
RDD is just a wrap of the scala collection, Maybe you can use the
.collect() method to get the scala collection type, you can then transfer
to a JSON object using the Scala method.
Sciences(IIIS
http://iiis.tsinghua.edu.cn/)*
*Tsinghua University, China*
*Email: lihu...@gmail.com lihu...@gmail.com*
*Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
http://iiis.tsinghua.edu.cn/zh/lihu/*
Hi,
The spark assembly is time costly. If I only need
the spark-assembly-1.1.0-hadoop2.3.0.jar, do not need
the spark-examples-1.1.0-hadoop2.3.0.jar. How to configure the spark to
avoid assemble the example jar. I know *export
SPARK_PREPEND_CLASSES=**true* method
can reduce the assembly, but
, 2014, at 7:50 PM, lihu lihu...@gmail.com wrote:
Hi,
The spark assembly is time costly. If I only need
the spark-assembly-1.1.0-hadoop2.3.0.jar, do not need
the spark-examples-1.1.0-hadoop2.3.0.jar. How to configure the spark to
avoid assemble the example jar. I know *export
Matei, sorry for my last typo error. And the tip can improve about 30s in
my computer.
On Wed, Nov 26, 2014 at 3:34 PM, lihu lihu...@gmail.com wrote:
Mater, thank you very much!
After take your advice, the time for assembly from about 20min down to
6min in my computer. that's a very big
​Which code do you used, do you caused by your own code or something in
spark itself?
On Tue, Jul 22, 2014 at 8:50 AM, hsy...@gmail.com hsy...@gmail.com wrote:
I have the same problem
On Sat, Jul 19, 2014 at 12:31 AM, lihu lihu...@gmail.com wrote:
Hi,
Everyone. I have a piece
Hi,
Everyone. I have a piece of following code. When I run it,
it occurred the error just like below, it seem that the SparkContext is not
serializable, but i do not try to use the SparkContext except the broadcast.
[In fact, this code is in the MLLib, I just try to broadcast the
I see that the task will either be a ShuffleMapTask or be a ResultTask, I
wonder which function will generate a ShuffleMapTask, which will generate a
ResultTask?
Hi,
I set a small cluster with 3 machines, every machine is 64GB RAM, 11
Core. and I used the spark0.9.
I have set spark-env.sh as following:
*SPARK_MASTER_IP=192.168.35.2*
* SPARK_MASTER_PORT=7077*
* SPARK_MASTER_WEBUI_PORT=12306*
* SPARK_WORKER_CORES=3*
*
Hi,
I just run a simple example to generate some data for the ALS
algorithm. my spark version is 0.9, and in local mode, the memory of my
node is 108G
but when I set conf.set(spark.akka.frameSize, 4096), it
then occurred the following problem, and when I do not set this, it runs
well .
Thanks, but I do not to log myself program info, I just do not want spark
output all the info to my console, I want the spark output the log into to
some file which I specified.
On Tue, Mar 11, 2014 at 11:49 AM, Robin Cjc cjcro...@gmail.com wrote:
Hi lihu,
you can extends
25 matches
Mail list logo