Re: Unresolved Attributes
ok I am answering my question here. looks like name has a reserved key word or some special treatment. unless you use alias, it doesn't work. so use an alias always with name attribute. select a.name from xxx a where a. = 'y' // RIGHT select name from where t ='yy' // doesn't work. not sure if theres an issue already and already fixed in master. I will raise an issue if someone else also confirms it. thx srinivas On Sat, Nov 8, 2014 at 3:26 PM, Srinivas Chamarthi srinivas.chamar...@gmail.com wrote: I have an exception when I am trying to run a simple where clause query. I can see the name attribute is present in the schema but somehow it still throws the exception. query = select name from business where business_id= + business_id what am I doing wrong ? thx srinivas Exception in thread main org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: 'name, tree: Project ['name] Filter (business_id#1 = 'Ba1hXOqb3Yhix8bhE0k_WQ) Subquery business SparkLogicalPlan (ExistingRdd [attributes#0,business_id#1,categories#2,city#3,full_address#4,hours#5,latitude#6,longitude#7,name#8,neighborhoods#9,open#10,review_count#11,stars#12,state#13,type#14], MappedRDD[5] at map at JsonRDD.scala:38)
Why does this siimple spark program uses only one core?
So, I'm running this simple program on a 16 core multicore system. I run it by issuing the following. spark-submit --master local[*] pi.py And the code of that program is the following. When I use top to see CPU consumption, only 1 core is being utilized. Why is it so? Seconldy, spark documentation says that the default parallelism is contained in property spark.default.parallelism. How can I read this property from within my python program? #pi.py from pyspark import SparkContext import random NUM_SAMPLES = 1250 def sample(p): x, y = random.random(), random.random() return 1 if x*x + y*y 1 else 0 sc = SparkContext(local, Test App) count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample).reduce(lambda a, b: a + b) print Pi is roughly %f % (4.0 * count / NUM_SAMPLES) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-this-siimple-spark-program-uses-only-one-core-tp18434.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Does spark works on multicore systems?
Also, the level of parallelism would be affected by how big your input is. Could this be a problem in your case? On Sunday, November 9, 2014, Aaron Davidson ilike...@gmail.com wrote: oops, meant to cc userlist too On Sat, Nov 8, 2014 at 3:13 PM, Aaron Davidson ilike...@gmail.com javascript:_e(%7B%7D,'cvml','ilike...@gmail.com'); wrote: The default local master is local[*], which should use all cores on your system. So you should be able to just do ./bin/pyspark and sc.parallelize(range(1000)).count() and see that all your cores were used. On Sat, Nov 8, 2014 at 2:20 PM, Blind Faith person.of.b...@gmail.com javascript:_e(%7B%7D,'cvml','person.of.b...@gmail.com'); wrote: I am a Spark newbie and I use python (pyspark). I am trying to run a program on a 64 core system, but no matter what I do, it always uses 1 core. It doesn't matter if I run it using spark-submit --master local[64] run.sh or I call x.repartition(64) in my code with an RDD, the spark program always uses one core. Has anyone experience of running spark programs on multicore processors with success? Can someone provide me a very simple example that does properly run on all cores of a multicore system? -- Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal
Re: Does spark works on multicore systems?
Try adding the following entry inside your conf/spark-defaults.conf file spark.cores.max 64 Thanks Best Regards On Sun, Nov 9, 2014 at 3:50 AM, Blind Faith person.of.b...@gmail.com wrote: I am a Spark newbie and I use python (pyspark). I am trying to run a program on a 64 core system, but no matter what I do, it always uses 1 core. It doesn't matter if I run it using spark-submit --master local[64] run.sh or I call x.repartition(64) in my code with an RDD, the spark program always uses one core. Has anyone experience of running spark programs on multicore processors with success? Can someone provide me a very simple example that does properly run on all cores of a multicore system?
Re: supported sql functions
http://spark.apache.org/docs/latest/sql-programming-guide.html#supported-hive-features 2014년 11월 9일 일요일, Srinivas Chamarthisrinivas.chamar...@gmail.com님이 작성한 메시지: can anyone point me to a documentation on supported sql functions ? I am trying to do a contians operation on sql array type. But I don't know how to type the sql. // like hive function array_contains select * from business where array_contains(type, insurance) appreciate any help.
Re: Why does this siimple spark program uses only one core?
You can set the following entry inside the conf/spark-defaults.conf file spark.cores.max 16 If you want to read the default value, then you can use the following api call *sc*.defaultParallelism where *sc* is your sparkContext object. Thanks Best Regards On Sun, Nov 9, 2014 at 6:48 PM, ReticulatedPython person.of.b...@gmail.com wrote: So, I'm running this simple program on a 16 core multicore system. I run it by issuing the following. spark-submit --master local[*] pi.py And the code of that program is the following. When I use top to see CPU consumption, only 1 core is being utilized. Why is it so? Seconldy, spark documentation says that the default parallelism is contained in property spark.default.parallelism. How can I read this property from within my python program? #pi.py from pyspark import SparkContext import random NUM_SAMPLES = 1250 def sample(p): x, y = random.random(), random.random() return 1 if x*x + y*y 1 else 0 sc = SparkContext(local, Test App) count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample).reduce(lambda a, b: a + b) print Pi is roughly %f % (4.0 * count / NUM_SAMPLES) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-this-siimple-spark-program-uses-only-one-core-tp18434.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark context not defined
If you are talking about a stand alone program, have a look at this doc. https://spark.apache.org/docs/0.9.1/python-programming-guide.html#standalone-programs from pyspark import SparkConf, SparkContext from pyspark.sql import HiveContext conf = (SparkConf() .setMaster(local) .setAppName(My app) .set(spark.executor.memory, 1g)) sc = SparkContext(conf = conf) sqlContext = HiveContext(sc) Thanks Best Regards On Sat, Nov 8, 2014 at 4:35 AM, Pagliari, Roberto rpagli...@appcomsci.com wrote: I’m running the latest version of spark with Hadoop 1.x and scala 2.9.3 and hive 0.9.0. When using python 2.7 from pyspark.sql import HiveContext sqlContext = HiveContext(sc) I’m getting ‘sc not defined’ On the other hand, I can see ‘sc’ from pyspark CLI. Is there a way to fix it?
Re: spark-submit inside script... need some bash help
Not sure why that is failing, but i found a workaround like: #!/bin/bash -e SPARK_SUBMIT=/home/akhld/mobi/localcluster/spark-1/bin/spark-submit *export _JAVA_OPTIONS=-Xmx1g* OPTS+= --class org.apache.spark.examples.SparkPi echo $SPARK_SUBMIT $OPTS lib/spark-examples-1.1.0-hadoop1.0.4.jar exec $SPARK_SUBMIT $OPTS lib/spark-examples-1.1.0-hadoop1.0.4.jar Thanks Best Regards On Sat, Nov 8, 2014 at 12:31 AM, Koert Kuipers ko...@tresata.com wrote: i need to run spark-submit inside a script with options that are build up programmatically. oh and i need to use exec to keep the same pid (so it can run as a service and be killed). this is what i tried: == #!/bin/bash -e SPARK_SUBMIT=/usr/local/lib/spark/bin/spark-submit OPTS=--class org.apache.spark.examples.SparkPi OPTS+= --driver-java-options \-Da=b -Dc=d\ echo $SPARK_SUBMIT $OPTS spark-examples_2.10-1.1.0.jar exec $SPARK_SUBMIT $OPTS spark-examples_2.10-1.1.0.jar == no luck. it gets confused on the multiple java options it seems. i get: Exception in thread main java.lang.NoClassDefFoundError: -Da=b Caused by: java.lang.ClassNotFoundException: -Da=b at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) Could not find the main class: -Da=b. Program will exit. i also tried many other ways of escaping the quoted java options. none of them work. strangely it does work if i replace the last line by (there is no science to this for me, i dont know much about bash, just trying random and probably bad things): eval exec $SPARK_SUBMIT $OPTS spark-examples_2.10-1.1.0.jar i am lost as to why... and there must be a better solution? it looks kinda nasty with the eval + exec best, koert
Re: Why does this siimple spark program uses only one core?
Call getNumPartitions() on your RDD to make sure it has the right number of partitions. You can also specify it when doing parallelize, e.g. rdd = sc.parallelize(xrange(1000), 10)) This should run in parallel if you have multiple partitions and cores, but it might be that during part of the process only one node (e.g. the master process) is doing anything. Matei On Nov 9, 2014, at 9:27 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You can set the following entry inside the conf/spark-defaults.conf file spark.cores.max 16 If you want to read the default value, then you can use the following api call sc.defaultParallelism where sc is your sparkContext object. Thanks Best Regards On Sun, Nov 9, 2014 at 6:48 PM, ReticulatedPython person.of.b...@gmail.com mailto:person.of.b...@gmail.com wrote: So, I'm running this simple program on a 16 core multicore system. I run it by issuing the following. spark-submit --master local[*] pi.py And the code of that program is the following. When I use top to see CPU consumption, only 1 core is being utilized. Why is it so? Seconldy, spark documentation says that the default parallelism is contained in property spark.default.parallelism. How can I read this property from within my python program? #pi.py from pyspark import SparkContext import random NUM_SAMPLES = 1250 def sample(p): x, y = random.random(), random.random() return 1 if x*x + y*y 1 else 0 sc = SparkContext(local, Test App) count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample).reduce(lambda a, b: a + b) print Pi is roughly %f % (4.0 * count / NUM_SAMPLES) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-this-siimple-spark-program-uses-only-one-core-tp18434.html http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-this-siimple-spark-program-uses-only-one-core-tp18434.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org
Re: netty on classpath when using spark-submit
Hi, On Wed, Nov 5, 2014 at 10:23 AM, Tobias Pfeiffer wrote: On Tue, Nov 4, 2014 at 8:33 PM, M. Dale wrote: From http://spark.apache.org/docs/latest/configuration.html it seems that there is an experimental property: spark.files.userClassPathFirst Thank you very much, I didn't know about this. Unfortunately, it doesn't change anything. With this setting both true and false (as indicated by the Spark web interface) and no matter whether local[N] or yarn-client or yarn-cluster mode are used with spark-submit, the classpath looks the same and the netty class is loaded from the Spark jar. Can I use this setting with spark-submit at all? Has anyone used this setting successfully or can advice me on how to use it correctly? Thanks Tobias
Re: Re: about write mongodb in mapPartitions
Thanks for your reply! According to your hint, the code should be like this: // i want to save data in rdd to mongodb and hdfs rdd.saveAsNewAPIHadoopFile() rdd.saveAsTextFile() but will the application read hdfs twice? qinwei From: Akhil DasDate: 2014-11-07 18:32To: qinweiCC: userSubject: Re: about write mongodb in mapPartitionsWhy not saveAsNewAPIHadoopFile? //Define your mongoDB confsval config = new Configuration() config.set(mongo.output.uri, mongodb://127.0.0.1:27017/sigmoid.output) //Write everything to mongo rdd.saveAsNewAPIHadoopFile(file:///some/random, classOf[Any], classOf[Any], classOf[com.mongodb.hadoop.MongoOutputFormat[Any, Any]], config) ThanksBest Regards On Fri, Nov 7, 2014 at 2:53 PM, qinwei wei@dewmobile.net wrote: Hi, everyone I come across with a prolem about writing data to mongodb in mapPartitions, my code is as below: val sourceRDD = sc.textFile(hdfs://host:port/sourcePath) // some transformations val rdd= sourceRDD .map(mapFunc).filter(filterFunc) val newRDD = rdd.mapPartitions(args = { val mongoClient = new MongoClient(host, port) val db = mongoClient.getDB(db) val coll = db.getCollection(collectionA) args.map(arg = { coll.insert(new BasicDBObject(pkg, arg)) arg }) mongoClient.close() args }) newRDD.saveAsTextFile(hdfs://host:port/path) The application saved data to HDFS correctly, but not mongodb, is there someting wrong? I know that collecting the newRDD to driver and then saving it to mongodb will success, but will the following saveAsTextFile read the filesystem once again? Thanks qinwei
Re: Re: about write mongodb in mapPartitions
Thanks for your reply! As you mentioned , the insert clause is not executed as the results of args.map are never used anywhere, and after i modified the code , it works. qinwei From: Tobias PfeifferDate: 2014-11-07 18:04To: qinweiCC: userSubject: Re: about write mongodb in mapPartitionsHi, On Fri, Nov 7, 2014 at 6:23 PM, qinwei wei@dewmobile.net wrote: args.map(arg = { coll.insert(new BasicDBObject(pkg, arg)) arg }) mongoClient.close() args As the results of args.map are never used anywhere, I think the loop body is not executed at all. Maybe try: val argsProcessed = args.map(arg = { coll.insert(new BasicDBObject(pkg, arg)) arg }) mongoClient.close() argsProcessed Tobias
Re: PySpark issue with sortByKey: IndexError: list index out of range
Sorry for the delay. I'll try to add some more details on Monday. Unfortunately, I don't have a script to reproduce the error. Actually, it seemed to be more about the data set than the script. The same code on different data sets lead to different results; only larger data sets on the order of 40 GB seemed to crash with the described error. Also, I believe our cluster was recently updated to CDH 5.2, which uses Spark 1.1. I'll check to see if the issue was resolved. On Fri, Nov 7, 2014 at 6:03 PM, Davies Liu-2 [via Apache Spark User List] ml-node+s1001560n18393...@n3.nabble.com wrote: Could you tell how large is the data set? It will help us to debug this issue. On Thu, Nov 6, 2014 at 10:39 AM, skane [hidden email] http://user/SendEmail.jtp?type=nodenode=18393i=0 wrote: I don't have any insight into this bug, but on Spark version 1.0.0 I ran into the same bug running the 'sort.py' example. On a smaller data set, it worked fine. On a larger data set I got this error: Traceback (most recent call last): File /home/skane/spark/examples/src/main/python/sort.py, line 30, in module .sortByKey(lambda x: x) File /usr/lib/spark/python/pyspark/rdd.py, line 480, in sortByKey bounds.append(samples[index]) IndexError: list index out of range -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18288.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=18393i=1 For additional commands, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=18393i=2 - To unsubscribe, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=18393i=3 For additional commands, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=18393i=4 -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18393.html To unsubscribe from PySpark issue with sortByKey: IndexError: list index out of range, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=16445code=c3RldmVuLm0uYW50b25AZ21haWwuY29tfDE2NDQ1fDEzNTcxOTI5 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18442.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Repartition to data-size per partition
I want to avoid the small files problem when using Spark, without having to manually calibrate a `repartition` at the end of each Spark application I am writing, since the amount of data passing through sadly isn't all that predictable. We're picking up from and writing data to HDFS. I know other tools like Pig can set the number of reducers and thus the number of output partitions for you based on the size of the input data, but I want to know if anyone else has a better way to do this with Spark's primitives. Right now we have an ok solution but it is starting to break down. We cache our output RDD at the end of the application's flow, and then map over once more it to guess what size it will be when pickled and gzipped (we're in pyspark), and then compute a number to repartition to using a target partition size. The problem is that we want to work with datasets bigger than what will comfortably fit in the cache. Just spit balling here, but what would be amazing is the ability to ask Spark how big it thinks each partition might be, or the ability to give an accumulator as an argument to `repartition` who's value wouldn't be used until the stage prior had finished, or the ability to just have Spark repartition to a target partition size for us. Thanks for any help you can give me!
Re: Submitting Spark job on Unix cluster from dev environment (Windows)
yeah, It work. I turn off firewall on my windows machine and it work. Thanks so much. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-job-on-Unix-cluster-from-dev-environment-Windows-tp16989p18452.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: org/apache/commons/math3/random/RandomGenerator issue
I set the path of commons-math3-3.1.1.jar to spark.executor.extraClassPath and it worked. Thanks a lot! It only worked for me when the jar was locally on the machine. Is there a way to make it work when the jar is on hdfs? I tried putting there a link to the file on the hdfs (with or without hdfs:// ) and it didn't work.. Thanks, Lev. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-RandomGenerator-issue-tp15748p18453.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
embedded spark for unit testing..
What’s the best way to embed spark to run local mode in unit tests? Some or our jobs are mildly complex and I want to keep verifying that they work including during schema changes / migration. I think for some of this I would just run local mode, read from a few text files via resources, and then write to /tmp … -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com
Queues
Has anyone implemented Queues using RDDs? Thank You
Rdd replication
Hi, I am trying to understand rdd replication code. In the process, I frequently execute one spark application whenever I make a change to the code to see effect. My problem is, after a set of repeated executions of the same application, I find that my cluster behaves unusually. Ideally, when I replicate an rdd twice, the webUI displays each partition twice in the RDD storage info tab. But, sometimes I find that it displays each partition only once. Also, when it is replicated only once, each partition gets displayed twice. This happens frequently. Can someone throw some light in this regard.
Re: Do spark works on multicore systems?
While creating sparkConf, set the variable *spark.cores.max* to thspark.cores.max maximum number of cores to be used by spark job. By default it is set to 1. - Lalit Yadav la...@sigmoidanalytics.com -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Do-spark-works-on-multicore-systems-tp18419p18459.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: embedded spark for unit testing..
You can write unittest with a local spark context by mixing LocalSparkContext trait. See https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/util/LocalSparkContext.scala as an example. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, Nov 9, 2014 at 9:12 PM, Kevin Burton bur...@spinn3r.com wrote: What’s the best way to embed spark to run local mode in unit tests? Some or our jobs are mildly complex and I want to keep verifying that they work including during schema changes / migration. I think for some of this I would just run local mode, read from a few text files via resources, and then write to /tmp … -- Founder/CEO Spinn3r.com Location: San Francisco, CA blog: http://burtonator.wordpress.com … or check out my Google+ profile - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Efficient Key Structure in pairRDD
Hi, We are trying to adopt Spark for our application. We have an analytical application which stores data in Star Schemas ( SQL Server ). All the cubes are loaded into a Key / Value structure and saved in Trove ( in memory collection ). here key is a short array where each short number represents a dimension member. e.g Tuple = CampaignX,Product1,Region_south,10.23232 gets converted to Trove Key[[12322],[45232],[53421]] Value[10.23232]. This is done to avoid saving collection of string objects as key in Trove. Now can we save this data structure in Spark using pairRDD? if yes, will key value be an ideal way of storing data in spark and retrieving it for data analysis, or is there any other better data structure we can create, which would help us create and process RDD ? Nitin. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-Key-Structure-in-pairRDD-tp18461.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
canopy clustering
I want to run k-means of MLib on a big dataset, it seems for big datsets, we need to perform pre-clustering methods such as canopy clustering. By starting with an initial clustering the number of more expensive distance measurements can be significantly reduced by ignoring points outside of the initial canopies. I I am not mistaken, in the k-means of MLib, there are three initialization steps : Kmeans ++, Kmeans|| and random . So, can anyone explain to me that can we use kmeans|| instead of canopy clustering? or these two methods act completely different? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/canopy-clustering-tp18462.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org