Re: Unresolved Attributes

2014-11-09 Thread Srinivas Chamarthi
ok I am answering my question here. looks like name has a reserved key word or some special treatment. unless you use alias, it doesn't work. so use an alias always with name attribute. select a.name from xxx a where a. = 'y' // RIGHT select name from where t ='yy' // doesn't

Why does this siimple spark program uses only one core?

2014-11-09 Thread ReticulatedPython
So, I'm running this simple program on a 16 core multicore system. I run it by issuing the following. spark-submit --master local[*] pi.py And the code of that program is the following. When I use top to see CPU consumption, only 1 core is being utilized. Why is it so? Seconldy, spark

Re: Does spark works on multicore systems?

2014-11-09 Thread Sonal Goyal
Also, the level of parallelism would be affected by how big your input is. Could this be a problem in your case? On Sunday, November 9, 2014, Aaron Davidson ilike...@gmail.com wrote: oops, meant to cc userlist too On Sat, Nov 8, 2014 at 3:13 PM, Aaron Davidson ilike...@gmail.com

Re: Does spark works on multicore systems?

2014-11-09 Thread Akhil Das
Try adding the following entry inside your conf/spark-defaults.conf file spark.cores.max 64 Thanks Best Regards On Sun, Nov 9, 2014 at 3:50 AM, Blind Faith person.of.b...@gmail.com wrote: I am a Spark newbie and I use python (pyspark). I am trying to run a program on a 64 core system, but no

Re: supported sql functions

2014-11-09 Thread Nicholas Chammas
http://spark.apache.org/docs/latest/sql-programming-guide.html#supported-hive-features 2014년 11월 9일 일요일, Srinivas Chamarthisrinivas.chamar...@gmail.com님이 작성한 메시지: can anyone point me to a documentation on supported sql functions ? I am trying to do a contians operation on sql array type. But I

Re: Why does this siimple spark program uses only one core?

2014-11-09 Thread Akhil Das
You can set the following entry inside the conf/spark-defaults.conf file spark.cores.max 16 If you want to read the default value, then you can use the following api call *sc*.defaultParallelism where ​*sc* is your sparkContext object.​ Thanks Best Regards On Sun, Nov 9, 2014 at 6:48 PM,

Re: spark context not defined

2014-11-09 Thread Akhil Das
If you are talking about a stand alone program, have a look at this doc. https://spark.apache.org/docs/0.9.1/python-programming-guide.html#standalone-programs from pyspark import SparkConf, SparkContext from pyspark.sql import HiveContext conf = (SparkConf() .setMaster(local)

Re: spark-submit inside script... need some bash help

2014-11-09 Thread Akhil Das
Not sure why that is failing, but i found a workaround like: #!/bin/bash -e SPARK_SUBMIT=/home/akhld/mobi/localcluster/spark-1/bin/spark-submit *export _JAVA_OPTIONS=-Xmx1g* OPTS+= --class org.apache.spark.examples.SparkPi echo $SPARK_SUBMIT $OPTS lib/spark-examples-1.1.0-hadoop1.0.4.jar

Re: Why does this siimple spark program uses only one core?

2014-11-09 Thread Matei Zaharia
Call getNumPartitions() on your RDD to make sure it has the right number of partitions. You can also specify it when doing parallelize, e.g. rdd = sc.parallelize(xrange(1000), 10)) This should run in parallel if you have multiple partitions and cores, but it might be that during part of the

Re: netty on classpath when using spark-submit

2014-11-09 Thread Tobias Pfeiffer
Hi, On Wed, Nov 5, 2014 at 10:23 AM, Tobias Pfeiffer wrote: On Tue, Nov 4, 2014 at 8:33 PM, M. Dale wrote: From http://spark.apache.org/docs/latest/configuration.html it seems that there is an experimental property: spark.files.userClassPathFirst Thank you very much, I didn't know

Re: Re: about write mongodb in mapPartitions

2014-11-09 Thread qinwei
Thanks for your reply!    According to your hint, the code should be like this:       // i want to save data in rdd to mongodb and hdfs         rdd.saveAsNewAPIHadoopFile()        rdd.saveAsTextFile()     but will the application read hdfs twice? qinwei  From: Akhil DasDate: 2014-11-07 

Re: Re: about write mongodb in mapPartitions

2014-11-09 Thread qinwei
Thanks for your reply! As you mentioned , the insert clause is not executed as the results of args.map are never used anywhere, and after i modified the code , it works. qinwei  From: Tobias PfeifferDate: 2014-11-07 18:04To: qinweiCC: userSubject: Re: about write mongodb in

Re: PySpark issue with sortByKey: IndexError: list index out of range

2014-11-09 Thread santon
Sorry for the delay. I'll try to add some more details on Monday. Unfortunately, I don't have a script to reproduce the error. Actually, it seemed to be more about the data set than the script. The same code on different data sets lead to different results; only larger data sets on the order of

Repartition to data-size per partition

2014-11-09 Thread Harry Brundage
I want to avoid the small files problem when using Spark, without having to manually calibrate a `repartition` at the end of each Spark application I am writing, since the amount of data passing through sadly isn't all that predictable. We're picking up from and writing data to HDFS. I know other

Re: Submitting Spark job on Unix cluster from dev environment (Windows)

2014-11-09 Thread thanhtien522
yeah, It work. I turn off firewall on my windows machine and it work. Thanks so much. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-job-on-Unix-cluster-from-dev-environment-Windows-tp16989p18452.html Sent from the Apache Spark User List

Re: org/apache/commons/math3/random/RandomGenerator issue

2014-11-09 Thread lev
I set the path of commons-math3-3.1.1.jar to spark.executor.extraClassPath and it worked. Thanks a lot! It only worked for me when the jar was locally on the machine. Is there a way to make it work when the jar is on hdfs? I tried putting there a link to the file on the hdfs (with or without

embedded spark for unit testing..

2014-11-09 Thread Kevin Burton
What’s the best way to embed spark to run local mode in unit tests? Some or our jobs are mildly complex and I want to keep verifying that they work including during schema changes / migration. I think for some of this I would just run local mode, read from a few text files via resources, and

Queues

2014-11-09 Thread Deep Pradhan
Has anyone implemented Queues using RDDs? Thank You

Rdd replication

2014-11-09 Thread rapelly kartheek
Hi, I am trying to understand rdd replication code. In the process, I frequently execute one spark application whenever I make a change to the code to see effect. My problem is, after a set of repeated executions of the same application, I find that my cluster behaves unusually. Ideally, when

Re: Do spark works on multicore systems?

2014-11-09 Thread lalit1303
While creating sparkConf, set the variable *spark.cores.max* to thspark.cores.max maximum number of cores to be used by spark job. By default it is set to 1. - Lalit Yadav la...@sigmoidanalytics.com -- View this message in context:

Re: embedded spark for unit testing..

2014-11-09 Thread DB Tsai
You can write unittest with a local spark context by mixing LocalSparkContext trait. See https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala

Efficient Key Structure in pairRDD

2014-11-09 Thread nsareen
Hi, We are trying to adopt Spark for our application. We have an analytical application which stores data in Star Schemas ( SQL Server ). All the cubes are loaded into a Key / Value structure and saved in Trove ( in memory collection ). here key is a short array where each short number

canopy clustering

2014-11-09 Thread aminn_524
I want to run k-means of MLib on a big dataset, it seems for big datsets, we need to perform pre-clustering methods such as canopy clustering. By starting with an initial clustering the number of more expensive distance measurements can be significantly reduced by ignoring points outside of the