ok I am answering my question here. looks like name has a reserved key word
or some special treatment. unless you use alias, it doesn't work. so use an
alias always with name attribute.
select a.name from xxx a where a. = 'y' // RIGHT
select name from where t ='yy' // doesn't
So, I'm running this simple program on a 16 core multicore system. I run it
by issuing the following.
spark-submit --master local[*] pi.py
And the code of that program is the following. When I use top to see CPU
consumption, only 1 core is being utilized. Why is it so? Seconldy, spark
Also, the level of parallelism would be affected by how big your input is.
Could this be a problem in your case?
On Sunday, November 9, 2014, Aaron Davidson ilike...@gmail.com wrote:
oops, meant to cc userlist too
On Sat, Nov 8, 2014 at 3:13 PM, Aaron Davidson ilike...@gmail.com
Try adding the following entry inside your conf/spark-defaults.conf file
spark.cores.max 64
Thanks
Best Regards
On Sun, Nov 9, 2014 at 3:50 AM, Blind Faith person.of.b...@gmail.com
wrote:
I am a Spark newbie and I use python (pyspark). I am trying to run a
program on a 64 core system, but no
http://spark.apache.org/docs/latest/sql-programming-guide.html#supported-hive-features
2014년 11월 9일 일요일, Srinivas Chamarthisrinivas.chamar...@gmail.com님이 작성한
메시지:
can anyone point me to a documentation on supported sql functions ? I am
trying to do a contians operation on sql array type. But I
You can set the following entry inside the conf/spark-defaults.conf file
spark.cores.max 16
If you want to read the default value, then you can use the following api
call
*sc*.defaultParallelism
where *sc* is your sparkContext object.
Thanks
Best Regards
On Sun, Nov 9, 2014 at 6:48 PM,
If you are talking about a stand alone program, have a look at this doc.
https://spark.apache.org/docs/0.9.1/python-programming-guide.html#standalone-programs
from pyspark import SparkConf, SparkContext
from pyspark.sql import HiveContext
conf = (SparkConf()
.setMaster(local)
Not sure why that is failing, but i found a workaround like:
#!/bin/bash -e
SPARK_SUBMIT=/home/akhld/mobi/localcluster/spark-1/bin/spark-submit
*export _JAVA_OPTIONS=-Xmx1g*
OPTS+= --class org.apache.spark.examples.SparkPi
echo $SPARK_SUBMIT $OPTS lib/spark-examples-1.1.0-hadoop1.0.4.jar
Call getNumPartitions() on your RDD to make sure it has the right number of
partitions. You can also specify it when doing parallelize, e.g.
rdd = sc.parallelize(xrange(1000), 10))
This should run in parallel if you have multiple partitions and cores, but it
might be that during part of the
Hi,
On Wed, Nov 5, 2014 at 10:23 AM, Tobias Pfeiffer wrote:
On Tue, Nov 4, 2014 at 8:33 PM, M. Dale wrote:
From http://spark.apache.org/docs/latest/configuration.html it seems
that there is an experimental property:
spark.files.userClassPathFirst
Thank you very much, I didn't know
Thanks for your reply! According to your hint, the code should be like this:
// i want to save data in rdd to mongodb and hdfs
rdd.saveAsNewAPIHadoopFile() rdd.saveAsTextFile()
but will the application read hdfs twice?
qinwei
From: Akhil DasDate: 2014-11-07
Thanks for your reply! As you mentioned , the insert clause is not executed as
the results of args.map are never used anywhere, and after i modified the code
, it works.
qinwei
From: Tobias PfeifferDate: 2014-11-07 18:04To: qinweiCC: userSubject: Re:
about write mongodb in
Sorry for the delay. I'll try to add some more details on Monday.
Unfortunately, I don't have a script to reproduce the error. Actually, it
seemed to be more about the data set than the script. The same code on
different data sets lead to different results; only larger data sets on the
order of
I want to avoid the small files problem when using Spark, without having to
manually calibrate a `repartition` at the end of each Spark application I
am writing, since the amount of data passing through sadly isn't all that
predictable. We're picking up from and writing data to HDFS.
I know other
yeah, It work.
I turn off firewall on my windows machine and it work.
Thanks so much.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-job-on-Unix-cluster-from-dev-environment-Windows-tp16989p18452.html
Sent from the Apache Spark User List
I set the path of commons-math3-3.1.1.jar to spark.executor.extraClassPath
and it worked.
Thanks a lot!
It only worked for me when the jar was locally on the machine.
Is there a way to make it work when the jar is on hdfs?
I tried putting there a link to the file on the hdfs (with or without
What’s the best way to embed spark to run local mode in unit tests?
Some or our jobs are mildly complex and I want to keep verifying that they
work including during schema changes / migration.
I think for some of this I would just run local mode, read from a few text
files via resources, and
Has anyone implemented Queues using RDDs?
Thank You
Hi,
I am trying to understand rdd replication code. In the process, I
frequently execute one spark application whenever I make a change to the
code to see effect.
My problem is, after a set of repeated executions of the same application,
I find that my cluster behaves unusually.
Ideally, when
While creating sparkConf, set the variable *spark.cores.max* to
thspark.cores.max maximum number of cores to be used by spark job.
By default it is set to 1.
-
Lalit Yadav
la...@sigmoidanalytics.com
--
View this message in context:
You can write unittest with a local spark context by mixing
LocalSparkContext trait.
See
https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala
Hi,
We are trying to adopt Spark for our application.
We have an analytical application which stores data in Star Schemas ( SQL
Server ). All the cubes are loaded into a Key / Value structure and saved in
Trove ( in memory collection ). here key is a short array where each short
number
I want to run k-means of MLib on a big dataset, it seems for big datsets, we
need to perform pre-clustering methods such as canopy clustering. By
starting with an initial clustering the number of more expensive distance
measurements can be significantly reduced by ignoring points outside of the
23 matches
Mail list logo