Re: how to set database in DataFrame.saveAsTable?
Hi, You can use sqlContext.sql("use ") before use dataframe.saveAsTable Hope it could be helpful Cheers Gen On Sun, Feb 21, 2016 at 9:55 AM, Glenwrote: > For dataframe in spark, so the table can be visited by hive. > > -- > Jacky Wang >
Re: dataframe slow down with tungsten turn on
Yes, the same code, the same result. In fact, the code has been running for a more one month. Before 1.5.0, the performance is quite the same, So I doubt that it is causd by tungsten. Gen On Wed, Nov 4, 2015 at 4:05 PM, Rick Moritz <rah...@gmail.com> wrote: > Something to check (just in case): > Are you getting identical results each time? > > On Wed, Nov 4, 2015 at 8:54 AM, gen tang <gen.tan...@gmail.com> wrote: > >> Hi sparkers, >> >> I am using dataframe to do some large ETL jobs. >> More precisely, I create dataframe from HIVE table and do some >> operations. And then I save it as json. >> >> When I used spark-1.4.1, the whole process is quite fast, about 1 mins. >> However, when I use the same code with spark-1.5.1(with tungsten turn on), >> it takes a about 2 hours to finish the same job. >> >> I checked the detail of tasks, almost all the time is consumed by >> computation. >> >> Any idea about why this happens? >> >> Thanks a lot in advance for your help. >> >> Cheers >> Gen >> >> >
dataframe slow down with tungsten turn on
Hi sparkers, I am using dataframe to do some large ETL jobs. More precisely, I create dataframe from HIVE table and do some operations. And then I save it as json. When I used spark-1.4.1, the whole process is quite fast, about 1 mins. However, when I use the same code with spark-1.5.1(with tungsten turn on), it takes a about 2 hours to finish the same job. I checked the detail of tasks, almost all the time is consumed by computation. Any idea about why this happens? Thanks a lot in advance for your help. Cheers Gen
Re: Spark works with the data in another cluster(Elasticsearch)
Great advice. Thanks a lot Nick. In fact, if we use rdd.persist(DISK) command at the beginning of the program to avoid hitting the network again and again. The speed is not influenced a lot. In my case, it is just 1 min more compared to the situation that we put the data in local HDFS. Cheers Gen On Tue, Aug 25, 2015 at 6:26 PM, Nick Pentreath nick.pentre...@gmail.com wrote: While it's true locality might speed things up, I'd say it's a very bad idea to mix your Spark and ES clusters - if your ES cluster is serving production queries (and in particular using aggregations), you'll run into performance issues on your production ES cluster. ES-hadoop uses ES scan scroll to pull data pretty efficiently, so pulling it across the network is not too bad. If you do need to avoid that, pull the data and write what you need to HDFS as say parquet files (eg pull data daily and write it, then you have all data available on your Spark cluster). And of course ensure thatbwhen you do pull data from ES to Spark, you cache it to avoid hitting the network again — Sent from Mailbox https://www.dropbox.com/mailbox On Tue, Aug 25, 2015 at 12:01 PM, Akhil Das ak...@sigmoidanalytics.com wrote: If the data is local to the machine then obviously it will be faster compared to pulling it through the network and storing it locally (either memory or disk etc). Have a look at the data locality http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html . Thanks Best Regards On Tue, Aug 18, 2015 at 8:09 PM, gen tang gen.tan...@gmail.com wrote: Hi, Currently, I have my data in the cluster of Elasticsearch and I try to use spark to analyse those data. The cluster of Elasticsearch and the cluster of spark are two different clusters. And I use hadoop input format(es-hadoop) to read data in ES. I am wondering how this environment affect the speed of analysis. If I understand well, spark will read data from ES cluster and do calculate on its own cluster(include writing shuffle result on its own machine), Is this right? If this is correct, I think that the performance will just a little bit slower than the data stored on the same cluster. I will be appreciated if someone can share his/her experience about using spark with elasticsearch. Thanks a lot in advance for your help. Cheers Gen
Spark works with the data in another cluster(Elasticsearch)
Hi, Currently, I have my data in the cluster of Elasticsearch and I try to use spark to analyse those data. The cluster of Elasticsearch and the cluster of spark are two different clusters. And I use hadoop input format(es-hadoop) to read data in ES. I am wondering how this environment affect the speed of analysis. If I understand well, spark will read data from ES cluster and do calculate on its own cluster(include writing shuffle result on its own machine), Is this right? If this is correct, I think that the performance will just a little bit slower than the data stored on the same cluster. I will be appreciated if someone can share his/her experience about using spark with elasticsearch. Thanks a lot in advance for your help. Cheers Gen
Re: Questions about SparkSQL join on not equality conditions
Hi, After taking a look at the code, I found out the problem: As spark will use broadcastNestedLoopJoin to treat nonequality condition. And one of my dataframe(df1) is created from an existing RDD(logicalRDD), so it uses defaultSizeInBytes * length to estimate the size. The other dataframe(df2) that I use is created from hive table(about 1G). Therefore spark think df1 is larger than df2, although df1 is very small. As a result, spark try to do df2.collect(), which causes the error. Hope this could be helpful Cheers Gen On Mon, Aug 10, 2015 at 11:29 PM, gen tang gen.tan...@gmail.com wrote: Hi, I am sorry to bother again. When I do join as follow: df = sqlContext.sql(selet a.someItem, b.someItem from a full outer join b on condition1 *or* condition2) df.first() The program failed at the result size is bigger than spark.driver.maxResultSize. It is really strange, as one record is no way bigger than 1G. When I do join on just one condition or equity condition, there will be no problem. Could anyone help me, please? Thanks a lot in advance. Cheers Gen On Sun, Aug 9, 2015 at 9:08 PM, gen tang gen.tan...@gmail.com wrote: Hi, I might have a stupid question about sparksql's implementation of join on not equality conditions, for instance condition1 or condition2. In fact, Hive doesn't support such join, as it is very difficult to express such conditions as a map/reduce job. However, sparksql supports such operation. So I would like to know how spark implement it. As I observe such join runs very slow, I guess that spark implement it by doing filter on the top of cartesian product. Is it true? Thanks in advance for your help. Cheers Gen
Re: Questions about SparkSQL join on not equality conditions
Hi, I am sorry to bother again. When I do join as follow: df = sqlContext.sql(selet a.someItem, b.someItem from a full outer join b on condition1 *or* condition2) df.first() The program failed at the result size is bigger than spark.driver.maxResultSize. It is really strange, as one record is no way bigger than 1G. When I do join on just one condition or equity condition, there will be no problem. Could anyone help me, please? Thanks a lot in advance. Cheers Gen On Sun, Aug 9, 2015 at 9:08 PM, gen tang gen.tan...@gmail.com wrote: Hi, I might have a stupid question about sparksql's implementation of join on not equality conditions, for instance condition1 or condition2. In fact, Hive doesn't support such join, as it is very difficult to express such conditions as a map/reduce job. However, sparksql supports such operation. So I would like to know how spark implement it. As I observe such join runs very slow, I guess that spark implement it by doing filter on the top of cartesian product. Is it true? Thanks in advance for your help. Cheers Gen
Questions about SparkSQL join on not equality conditions
Hi, I might have a stupid question about sparksql's implementation of join on not equality conditions, for instance condition1 or condition2. In fact, Hive doesn't support such join, as it is very difficult to express such conditions as a map/reduce job. However, sparksql supports such operation. So I would like to know how spark implement it. As I observe such join runs very slow, I guess that spark implement it by doing filter on the top of cartesian product. Is it true? Thanks in advance for your help. Cheers Gen
Re: How to get total CPU consumption for Spark job
Hi, Spark UI or logs don't provide the situation of cluster. However, you can use Ganglia to monitor the situation of cluster. In spark-ec2, there is an option to install ganglia automatically. If you use CDH, you can also use Cloudera manager. Cheers Gen On Sat, Aug 8, 2015 at 6:06 AM, Xiao JIANG jiangxia...@outlook.com wrote: Hi all, I was running some Hive/spark job on hadoop cluster. I want to see how spark helps improve not only the elapsed time but also the total CPU consumption. For Hive, I can get the 'Total MapReduce CPU Time Spent' from the log when the job finishes. But I didn't find any CPU stats for Spark jobs from either spark log or web UI. Is there any place I can find the total CPU consumption for my spark job? Thanks! Here is the version info: Spark version 1.3.0 Using Scala version 2.10.4, Java 1.7.0_67 Thanks! Xiao
Re: Problems getting expected results from hbase_inputformat.py
Hi, In fact, Pyspark use org.apache.spark.examples.pythonconverters(./examples/src/main/scala/org/apache/spark/pythonconverters/) to transform object of Hbase result to python string. Spark update these two scripts recently. However, they are not included in the official release of spark. So you are trying to use this new python script with old jar. You can clone the newest code of spark from github and build examples jar. Then you can get correct result. Cheers Gen On Sat, Aug 8, 2015 at 5:03 AM, Eric Bless eric.bl...@yahoo.com.invalid wrote: I’m having some difficulty getting the desired results from the Spark Python example hbase_inputformat.py. I’m running with CDH5.4, hbase Version 1.0.0, Spark v 1.3.0 Using Python version 2.6.6 I followed the example to create a test HBase table. Here’s the data from the table I created – hbase(main):001:0 scan 'dev_wx_test' ROW COLUMN+CELL row1 column=f1:a, timestamp=1438716994027, value=value1 row1 column=f1:b, timestamp=1438717004248, value=value2 row2 column=f1:, timestamp=1438717014529, value=value3 row3 column=f1:, timestamp=1438717022756, value=value4 3 row(s) in 0.2620 seconds When either of these statements are included - “hbase_rdd = hbase_rdd.flatMapValues(lambda v: v.split(\n))” or “hbase_rdd = hbase_rdd.flatMapValues(lambda v: v.split(\n)).countByValue().items()” the result is - We only get the following printed; (row1, value2) is not printed: ((u'row1', u'value1'), 1) ((u'row2', u'value3'), 1) ((u'row3', u'value4'), 1) This looks like similar results to the following post I found - http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-td18613.html#a18650 but it appears the pythonconverter HBaseResultToStringConverter has been updated since then. And this problem will be resolved too. When the statement “hbase_rdd = hbase_rdd.flatMapValues(lambda v: v.split(\n)).mapValues(json.loads)” is included, the result is – ValueError: No JSON object could be decoded ** Here is more info on this from the log – Traceback (most recent call last): File hbase_inputformat.py, line 87, in module output = hbase_rdd.collect() File /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/rdd.py, line 701, in collect File /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/py4j/java_gateway.py, line 538, in __call__ File /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o44.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, stluhdpddev27.monsanto.com): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/worker.py, line 101, in main process() File /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/worker.py, line 96, in process serializer.dump_stream(func(split_index, iterator), outfile) File /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/serializers.py, line 236, in dump_stream vs = list(itertools.islice(iterator, batch)) File /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/rdd.py, line 1807, in lambda File /usr/lib64/python2.6/json/__init__.py, line 307, in loads return _default_decoder.decode(s) File /usr/lib64/python2.6/json/decoder.py, line 319, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File /usr/lib64/python2.6/json/decoder.py, line 338, in raw_decode raise ValueError(No JSON object could be decoded) ValueError: No JSON object could be decoded at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135) at org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:176) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at
Re: Spark MLib v/s SparkR
Hi, It depends on the problem that you work on. Just as python and R, Mllib focuses on machine learning and SparkR will focus on statistics, if SparkR follow the way of R. For instance, If you want to use glm to analyse data: 1. if you are interested only in parameters of model, and use this model to predict, then you should use Mllib 2. if your focus is on confidence of the model, for example the confidence interval of result and the significant level of parameters, you should choose SparkR. However, as there is no glm package to this purpose yet, you need to code it by yourself. Hope it can be helpful Cheers Gen On Thu, Aug 6, 2015 at 2:24 AM, praveen S mylogi...@gmail.com wrote: I was wondering when one should go for MLib or SparkR. What is the criteria or what should be considered before choosing either of the solutions for data analysis? or What is the advantages of Spark MLib over Spark R or advantages of SparkR over MLib?
Re: large scheduler delay in pyspark
Hi, Thanks a lot for your reply. It seems that it is because of the slowness of the second code. I rewrite code as list(set([i.items for i in a] + [i.items for i in b])). The program returns normal. By the way, I find that when the computation is running, UI will show scheduler delay. However, it is not scheduler delay. When computation finishes, UI will show correct scheduler delay time. Cheers Gen On Tue, Aug 4, 2015 at 3:13 PM, Davies Liu dav...@databricks.com wrote: On Mon, Aug 3, 2015 at 9:00 AM, gen tang gen.tan...@gmail.com wrote: Hi, Recently, I met some problems about scheduler delay in pyspark. I worked several days on this problem, but not success. Therefore, I come to here to ask for help. I have a key_value pair rdd like rdd[(key, list[dict])] and I tried to merge value by adding two list if I do reduceByKey as follows: rdd.reduceByKey(lambda a, b: a+b) It works fine, scheduler delay is less than 10s. However if I do reduceByKey: def f(a, b): for i in b: if i not in a: a.append(i) return a rdd.reduceByKey(f) Is it possible that you have large object that is also named `i` or `a` or `b`? Btw, the second one could be slow than first one, because you try to lookup a object in a list, that is O(N), especially when the object is large (dict). It will cause very large scheduler delay, about 15-20 mins.(The data I deal with is about 300 mb, and I use 5 machine with 32GB memory) If you see scheduler delay, it means there may be a large broadcast involved. I know the second code is not the same as the first. In fact, my purpose is to implement the second, but not work. So I try the first one. I don't know whether this is related to the data(with long string) or Spark on Yarn. But the first code works fine on the same data. Is there any way to find out the log when spark stall in scheduler delay, please? Or any ideas about this problem? Thanks a lot in advance for your help. Cheers Gen
large scheduler delay in pyspark
Hi, Recently, I met some problems about scheduler delay in pyspark. I worked several days on this problem, but not success. Therefore, I come to here to ask for help. I have a key_value pair rdd like rdd[(key, list[dict])] and I tried to merge value by adding two list if I do reduceByKey as follows: rdd.reduceByKey(lambda a, b: a+b) It works fine, scheduler delay is less than 10s. However if I do reduceByKey: def f(a, b): for i in b: if i not in a: a.append(i) return a rdd.reduceByKey(f) It will cause very large scheduler delay, about 15-20 mins.(The data I deal with is about 300 mb, and I use 5 machine with 32GB memory) I know the second code is not the same as the first. In fact, my purpose is to implement the second, but not work. So I try the first one. I don't know whether this is related to the data(with long string) or Spark on Yarn. But the first code works fine on the same data. Is there any way to find out the log when spark stall in scheduler delay, please? Or any ideas about this problem? Thanks a lot in advance for your help. Cheers Gen
Strange behavoir of pyspark with --jars option
Hi, I met some interesting problems with --jars options As I use the third party dependencies: elasticsearch-spark, I pass this jar with the following command: ./bin/spark-submit --jars path-to-dependencies ... It works well. However, if I use HiveContext.sql, spark will lost the dependencies that I passed.It seems that the execution of HiveContext will override the configuration.(But if we check sparkContext._conf, the configuration is unchanged) But if I passed dependencies with --driver-class-path and spark.executor.extraClassPath. The problem will disappear. Is there anyone know why this interesting problem happens? Thanks a lot for your help in advance. Cheers Gen
Re: pyspark hbase range scan
Hi, Maybe this might be helpful: https://github.com/GenTang/spark_hbase/blob/master/src/main/scala/examples/pythonConverters.scala Cheers Gen On Thu, Apr 2, 2015 at 1:50 AM, Eric Kimbrel eric.kimb...@soteradefense.com wrote: I am attempting to read an hbase table in pyspark with a range scan. conf = { hbase.zookeeper.quorum: host, hbase.mapreduce.inputtable: table, hbase.mapreduce.scan : scan } hbase_rdd = sc.newAPIHadoopRDD( org.apache.hadoop.hbase.mapreduce.TableInputFormat, org.apache.hadoop.hbase.io.ImmutableBytesWritable, org.apache.hadoop.hbase.client.Result, keyConverter=keyConv, valueConverter=valueConv, conf=conf) If i jump over to scala or java and generate a base64 encoded protobuf scan object and convert it to a string, i can use that value for hbase.mapreduce.scan and everything works, the rdd will correctly perform the range scan and I am happy. The problem is that I can not find any reasonable way to generate that range scan string in python. The scala code required is: import org.apache.hadoop.hbase.util.Base64; import org.apache.hadoop.hbase.protobuf.ProtobufUtil; import org.apache.hadoop.hbase.client.{Delete, HBaseAdmin, HTable, Put, Result = HBaseResult, Scan} val scan = new Scan() scan.setStartRow(test_domain\0email.getBytes) scan.setStopRow(test_domain\0email~.getBytes) def scanToString(scan:Scan): String = { Base64.encodeBytes( ProtobufUtil.toScan(scan).toByteArray()) } scanToString(scan) Is there another way to perform an hbase range scan from pyspark or is that functionality something that might be supported in the future? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-hbase-range-scan-tp22348.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Apache Spark ALS recommendations approach
Hi, If you do cartesian join to predict users' preference over all the products, I think that 8 nodes with 64GB ram would not be enough for the data. Recently, I used als for a similar situation, but just 10M users and 0.1M products, the minimum requirement is 9 nodes with 10GB RAM. Moreover, even the program pass, the time of treatment will be very long. Maybe you should try to reduce the set to predict for each client, as in practice, you never need predict the preference of all products to make a recommendation. Hope this will be helpful. Cheers Gen On Wed, Mar 18, 2015 at 12:13 PM, Aram Mkrtchyan aram.mkrtchyan...@gmail.com wrote: Trying to build recommendation system using Spark MLLib's ALS. Currently, we're trying to pre-build recommendations for all users on daily basis. We're using simple implicit feedbacks and ALS. The problem is, we have 20M users and 30M products, and to call the main predict() method, we need to have the cartesian join for users and products, which is too huge, and it may take days to generate only the join. Is there a way to avoid cartesian join to make the process faster? Currently we have 8 nodes with 64Gb of RAM, I think it should be enough for the data. val users: RDD[Int] = ??? // RDD with 20M userIds val products: RDD[Int] = ???// RDD with 30M productIds val ratings : RDD[Rating] = ??? // RDD with all user-product feedbacks val model = new ALS().setRank(10).setIterations(10) .setLambda(0.0001).setImplicitPrefs(true) .setAlpha(40).run(ratings) val usersProducts = users.cartesian(products) val recommendations = model.predict(usersProducts)
Re: Does anyone integrate HBASE on Spark
Hi, There are some examples in spark/example https://github.com/apache/spark/tree/master/examples and there are also some examples in spark package http://spark-packages.org/. And I find this blog http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html is quite good. Hope it would be helpful Cheers Gen On Wed, Mar 4, 2015 at 6:51 PM, sandeep vura sandeepv...@gmail.com wrote: Hi Sparkers, How do i integrate hbase on spark !!! Appreciate for replies !! Regards, Sandeep.v
Re: Spark on EC2
Hi, I am sorry that I made a mistake on AWS tarif. You can read the email of sean owen which explains better the strategies to run spark on AWS. For your question: it means that you just download spark and unzip it. Then run spark shell by ./bin/spark-shell or ./bin/pyspark. It is useful to get familiar with spark. You can do this on your laptop as well as on ec2. In fact, running ./ec2/spark-ec2 means launching spark standalone mode on a cluster, you can find more details here: https://spark.apache.org/docs/latest/spark-standalone.html Cheers Gen On Tue, Feb 24, 2015 at 4:07 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Kindly bear with my questions as I am new to this. If you run spark on local mode on a ec2 machine What does this mean? Is it that I launch Spark cluster from my local machine,i.e., by running the shell script that is there in /spark/ec2? On Tue, Feb 24, 2015 at 8:32 PM, gen tang gen.tan...@gmail.com wrote: Hi, As a real spark cluster needs a least one master and one slaves, you need to launch two machine. Therefore the second machine is not free. However, If you run spark on local mode on a ec2 machine. It is free. The charge of AWS depends on how much and the types of machine that you launched, but not on the utilisation of machine. Hope it would help. Cheers Gen On Tue, Feb 24, 2015 at 3:55 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have just signed up for Amazon AWS because I learnt that it provides service for free for the first 12 months. I want to run Spark on EC2 cluster. Will they charge me for this? Thank You
Re: Spark on EC2
Hi, As a real spark cluster needs a least one master and one slaves, you need to launch two machine. Therefore the second machine is not free. However, If you run spark on local mode on a ec2 machine. It is free. The charge of AWS depends on how much and the types of machine that you launched, but not on the utilisation of machine. Hope it would help. Cheers Gen On Tue, Feb 24, 2015 at 3:55 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have just signed up for Amazon AWS because I learnt that it provides service for free for the first 12 months. I want to run Spark on EC2 cluster. Will they charge me for this? Thank You
Re: Specifying AMI when using Spark EC-2 scripts
Hi, You can use -a or --ami your ami id to launch the cluster using specific ami. If I remember well, the default system is Amazon Linux. Hope it will help Cheers Gen On Sun, Feb 15, 2015 at 6:20 AM, olegshirokikh o...@solver.com wrote: Hi there, Is there a way to specify the AWS AMI with particular OS (say Ubuntu) when launching Spark on Amazon cloud with provided scripts? What is the default AMI, operating system that is launched by EC-2 script? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Specifying-AMI-when-using-Spark-EC-2-scripts-tp21658.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Loading JSON dataset with Spark Mllib
Hi, In fact, you can use sqlCtx.jsonFile() which loads a text file storing one JSON object per line as a SchemaRDD. Or you can use sc.textFile() to load the textFile to RDD and then use sqlCtx.jsonRDD() which loads an RDD storing one JSON object per string as a SchemaRDD. Hope it could help Cheers Gen On Mon, Feb 16, 2015 at 12:39 AM, pankaj channe pankajc...@gmail.com wrote: Hi, I am new to spark and planning on writing a machine learning application with Spark mllib. My dataset is in json format. Is it possible to load data into spark without using any external json libraries? I have explored the option of SparkSql but I believe that is only for interactive use or loading data into hive tables. Thanks, Pankaj
Re: Installing a python library along with ec2 cluster
Hi, Please take a look at http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/creating-an-ami-ebs.html Cheers Gen On Mon, Feb 9, 2015 at 6:41 AM, Chengi Liu chengi.liu...@gmail.com wrote: Hi I am very new both in spark and aws stuff.. Say, I want to install pandas on ec2.. (pip install pandas) How do I create the image and the above library which would be used from pyspark. Thanks On Sun, Feb 8, 2015 at 3:03 AM, gen tang gen.tan...@gmail.com wrote: Hi, You can make a image of ec2 with all the python libraries installed and create a bash script to export python_path in the /etc/init.d/ directory. Then you can launch the cluster with this image and ec2.py Hope this can be helpful Cheers Gen On Sun, Feb 8, 2015 at 9:46 AM, Chengi Liu chengi.liu...@gmail.com wrote: Hi, I want to install couple of python libraries (pip install python_library) which I want to use on pyspark cluster which are developed using the ec2 scripts. Is there a way to specify these libraries when I am building those ec2 clusters? Whats the best way to install these libraries on each ec2 node? Thanks
Re: Installing a python library along with ec2 cluster
Hi, You can make a image of ec2 with all the python libraries installed and create a bash script to export python_path in the /etc/init.d/ directory. Then you can launch the cluster with this image and ec2.py Hope this can be helpful Cheers Gen On Sun, Feb 8, 2015 at 9:46 AM, Chengi Liu chengi.liu...@gmail.com wrote: Hi, I want to install couple of python libraries (pip install python_library) which I want to use on pyspark cluster which are developed using the ec2 scripts. Is there a way to specify these libraries when I am building those ec2 clusters? Whats the best way to install these libraries on each ec2 node? Thanks
Re: no space left at worker node
Hi, I fact, I met this problem before. it is a bug of AWS. Which type of machine do you use? If I guess well, you can check the file /etc/fstab. There would be a double mount of /dev/xvdb. If yes, you should 1. stop hdfs 2. umount /dev/xvdb at / 3. restart hdfs Hope this could be helpful. Cheers Gen On Sun, Feb 8, 2015 at 8:16 AM, ey-chih chow eyc...@hotmail.com wrote: Hi, I submitted a spark job to an ec2 cluster, using spark-submit. At a worker node, there is an exception of 'no space left on device' as follows. == 15/02/08 01:53:38 ERROR logging.FileAppender: Error writing stream to file /root/spark/work/app-20150208014557-0003/0/stdout java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:345) at org.apache.spark.util.logging.FileAppender.appendToFile(FileAppender.scala:92) at org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:72) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38) === The command df showed the following information at the worker node: Filesystem 1K-blocks Used Available Use% Mounted on /dev/xvda1 8256920 8256456 0 100% / tmpfs 7752012 0 7752012 0% /dev/shm /dev/xvdb 30963708 1729652 27661192 6% /mnt Does anybody know how to fix this? Thanks. Ey-Chih Chow -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/no-space-left-at-worker-node-tp21545.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: no space left at worker node
Hi, In fact, /dev/sdb is /dev/xvdb. It seems that there is no problem about double mount. However, there is no information about /mnt2. You should check whether /dev/sdc is well mounted or not. The reply of Micheal is good solution about this type of problem. You can check his site. Cheers Gen On Sun, Feb 8, 2015 at 5:53 PM, ey-chih chow eyc...@hotmail.com wrote: Gen, Thanks for your information. The content of /etc/fstab at the worker node (r3.large) is: # LABEL=/ / ext4defaults,noatime 1 1 tmpfs /dev/shmtmpfs defaults0 0 devpts /dev/ptsdevpts gid=5,mode=620 0 0 sysfs /syssysfs defaults0 0 proc/proc procdefaults0 0 /dev/sdb/mntauto defaults,noatime,nodiratime,comment=cloudconfig 0 0 /dev/sdc/mnt2 auto defaults,noatime,nodiratime,comment=cloudconfig 0 0 There is no entry of /dev/xvdb. Ey-Chih Chow -- Date: Sun, 8 Feb 2015 12:09:37 +0100 Subject: Re: no space left at worker node From: gen.tan...@gmail.com To: eyc...@hotmail.com CC: user@spark.apache.org Hi, I fact, I met this problem before. it is a bug of AWS. Which type of machine do you use? If I guess well, you can check the file /etc/fstab. There would be a double mount of /dev/xvdb. If yes, you should 1. stop hdfs 2. umount /dev/xvdb at / 3. restart hdfs Hope this could be helpful. Cheers Gen On Sun, Feb 8, 2015 at 8:16 AM, ey-chih chow eyc...@hotmail.com wrote: Hi, I submitted a spark job to an ec2 cluster, using spark-submit. At a worker node, there is an exception of 'no space left on device' as follows. == 15/02/08 01:53:38 ERROR logging.FileAppender: Error writing stream to file /root/spark/work/app-20150208014557-0003/0/stdout java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:345) at org.apache.spark.util.logging.FileAppender.appendToFile(FileAppender.scala:92) at org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:72) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38) === The command df showed the following information at the worker node: Filesystem 1K-blocks Used Available Use% Mounted on /dev/xvda1 8256920 8256456 0 100% / tmpfs 7752012 0 7752012 0% /dev/shm /dev/xvdb 30963708 1729652 27661192 6% /mnt Does anybody know how to fix this? Thanks. Ey-Chih Chow -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/no-space-left-at-worker-node-tp21545.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: no space left at worker node
Hi, I am sorry that I made a mistake. r3.large has only one SSD which has been mounted in /mnt. Therefore this is no /dev/sdc. In fact, the problem is that there is no space in the under / directory. So you should check whether your application write data under this directory(for instance, save file in file:///). If not, you can use watch du -sh to during the running time to figure out which directory is expanding. Normally, only /mnt directory which is supported by SSD is expanding significantly, because the data of hdfs is saved here. Then you can find the directory which caused no space problem and find out the specific reason. Cheers Gen On Sun, Feb 8, 2015 at 10:45 PM, ey-chih chow eyc...@hotmail.com wrote: Thanks Gen. How can I check if /dev/sdc is well mounted or not? In general, the problem shows up when I submit the second or third job. The first job I submit most likely will succeed. Ey-Chih Chow -- Date: Sun, 8 Feb 2015 18:18:03 +0100 Subject: Re: no space left at worker node From: gen.tan...@gmail.com To: eyc...@hotmail.com CC: user@spark.apache.org Hi, In fact, /dev/sdb is /dev/xvdb. It seems that there is no problem about double mount. However, there is no information about /mnt2. You should check whether /dev/sdc is well mounted or not. The reply of Micheal is good solution about this type of problem. You can check his site. Cheers Gen On Sun, Feb 8, 2015 at 5:53 PM, ey-chih chow eyc...@hotmail.com wrote: Gen, Thanks for your information. The content of /etc/fstab at the worker node (r3.large) is: # LABEL=/ / ext4defaults,noatime 1 1 tmpfs /dev/shmtmpfs defaults0 0 devpts /dev/ptsdevpts gid=5,mode=620 0 0 sysfs /syssysfs defaults0 0 proc/proc procdefaults0 0 /dev/sdb/mntauto defaults,noatime,nodiratime,comment=cloudconfig 0 0 /dev/sdc/mnt2 auto defaults,noatime,nodiratime,comment=cloudconfig 0 0 There is no entry of /dev/xvdb. Ey-Chih Chow -- Date: Sun, 8 Feb 2015 12:09:37 +0100 Subject: Re: no space left at worker node From: gen.tan...@gmail.com To: eyc...@hotmail.com CC: user@spark.apache.org Hi, I fact, I met this problem before. it is a bug of AWS. Which type of machine do you use? If I guess well, you can check the file /etc/fstab. There would be a double mount of /dev/xvdb. If yes, you should 1. stop hdfs 2. umount /dev/xvdb at / 3. restart hdfs Hope this could be helpful. Cheers Gen On Sun, Feb 8, 2015 at 8:16 AM, ey-chih chow eyc...@hotmail.com wrote: Hi, I submitted a spark job to an ec2 cluster, using spark-submit. At a worker node, there is an exception of 'no space left on device' as follows. == 15/02/08 01:53:38 ERROR logging.FileAppender: Error writing stream to file /root/spark/work/app-20150208014557-0003/0/stdout java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:345) at org.apache.spark.util.logging.FileAppender.appendToFile(FileAppender.scala:92) at org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:72) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38) === The command df showed the following information at the worker node: Filesystem 1K-blocks Used Available Use% Mounted on /dev/xvda1 8256920 8256456 0 100% / tmpfs 7752012 0 7752012 0% /dev/shm /dev/xvdb 30963708 1729652 27661192 6% /mnt Does anybody know how to fix this? Thanks. Ey-Chih Chow -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/no-space-left-at-worker-node-tp21545.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Pyspark Hbase scan.
Hi, In fact, this pull https://github.com/apache/spark/pull/3920 is to do Hbase scan. However, it is not merged yet. You can also take a look at the example code at http://spark-packages.org/package/20 which is using scala and python to read data from hbase. Hope this can be helpful. Cheers Gen On Thu, Feb 5, 2015 at 11:11 AM, Castberg, René Christian rene.castb...@dnvgl.com wrote: Hi, I am trying to do a hbase scan and read it into a spark rdd using pyspark. I have successfully written data to hbase from pyspark, and been able to read a full table from hbase using the python example code. Unfortunately I am unable to find any example code for doing an HBase scan and read it into a spark rdd from pyspark. I have found a scala example : http://stackoverflow.com/questions/25189527/how-to-process-a-range-of-hbase-rows-using-spark But i can't find anything on how to do this from python. Can anybody shed some light on how (and if) this can be done? Regards Rene Castberg ** This e-mail and any attachments thereto may contain confidential information and/or information protected by intellectual property rights for the exclusive attention of the intended addressees named above. If you have received this transmission in error, please immediately notify the sender by return e-mail and delete this message and its attachments. Unauthorized use, copying or further full or partial distribution of this e-mail or its contents is prohibited. **
Re: Fail to launch spark-shell on windows 2008 R2
Hi, Using spark under windows is a really bad idea, because even you solve the problems about hadoop, you probably will meet the problem of java.net.SocketException. connection reset by peer. It is caused by the fact we ask socket port too frequently under windows. In my knowledge, it is really difficult to solve. And you will find something really funny: the same code sometimes works and sometimes not, even in the shell mode. And I am sorry but I don't see the interest to run spark under windows and moreover using local file system in a business environment. Do you have a cluster in windows? FYI, I have used spark prebuilt on hadoop 1 under windows 7 and there is no problem to launch, but have problem of java.net.SocketException. If you are using spark prebuilt on hadoop 2, you should consider follow the solution provided by https://issues.apache.org/jira/browse/SPARK-2356 Cheers Gen On Thu, Jan 29, 2015 at 5:54 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: Install virtual box which run Linux? That does not help us. We have business reason to run it on Windows operating system, e.g. Windows 2008 R2. If anybody have done that, please give some advise on what version of spark, which version of Hadoop do you built spark against, etc…. Note that we only use local file system and do not have any hdfs file system at all. I don’t understand why spark generate so many error on Hadoop while we don’t even need hdfs. Ningjun *From:* gen tang [mailto:gen.tan...@gmail.com] *Sent:* Thursday, January 29, 2015 10:45 AM *To:* Wang, Ningjun (LNG-NPV) *Cc:* user@spark.apache.org *Subject:* Re: Fail to launch spark-shell on windows 2008 R2 Hi, I tried to use spark under windows once. However the only solution that I found is to install virtualbox Hope this can help you. Best Gen On Thu, Jan 29, 2015 at 4:18 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: I deployed spark-1.1.0 on Windows 7 and was albe to launch the spark-shell. I then deploy it to windows 2008 R2 and launch the spark-shell, I got the error java.lang.RuntimeException: Error while running command to get file permissions : java.io.IOExceptio n: Cannot run program ls: CreateProcess error=2, The system cannot find the file specified at java.lang.ProcessBuilder.start(Unknown Source) at org.apache.hadoop.util.Shell.runCommand(Shell.java:200) at org.apache.hadoop.util.Shell.run(Shell.java:182) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375) at org.apache.hadoop.util.Shell.execCommand(Shell.java:461) at org.apache.hadoop.util.Shell.execCommand(Shell.java:444) at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:710) at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFil eSystem.java:443) at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getPermission(RawLocalFileSyst em.java:418) Here is the detail output C:\spark-1.1.0\bin spark-shell 15/01/29 10:13:13 INFO SecurityManager: Changing view acls to: ningjun.wang, 15/01/29 10:13:13 INFO SecurityManager: Changing modify acls to: ningjun.wang, 15/01/29 10:13:13 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ningjun.wang, ); users with modify permissions: Set(ningjun.wang, ) 15/01/29 10:13:13 INFO HttpServer: Starting HTTP Server 15/01/29 10:13:14 INFO Server: jetty-8.y.z-SNAPSHOT 15/01/29 10:13:14 INFO AbstractConnector: Started SocketConnector@0.0.0.0:53692 15/01/29 10:13:14 INFO Utils: Successfully started service 'HTTP class server' on port 53692. Failed to created SparkJLineReader: java.lang.NoClassDefFoundError: Could not initialize class org.f usesource.jansi.internal.Kernel32 Falling back to SimpleReader. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.1.0 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71) Type in expressions to have them evaluated. Type :help for more information. 15/01/29 10:13:18 INFO SecurityManager: Changing view acls to: ningjun.wang, 15/01/29 10:13:18 INFO SecurityManager: Changing modify acls to: ningjun.wang, 15/01/29 10:13:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ningjun.wang, ); users with modify permissions: Set(ningjun.wang, ) 15/01/29 10:13:18 INFO Slf4jLogger: Slf4jLogger started 15/01/29 10:13:18 INFO Remoting: Starting remoting 15/01/29 10:13:19 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@L AB4-WIN01.pcc.lexisnexis.com:53705] 15/01/29 10:13:19 INFO Remoting: Remoting now listens on addresses: [akka.tcp
Re: Fail to launch spark-shell on windows 2008 R2
Hi, I tried to use spark under windows once. However the only solution that I found is to install virtualbox Hope this can help you. Best Gen On Thu, Jan 29, 2015 at 4:18 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: I deployed spark-1.1.0 on Windows 7 and was albe to launch the spark-shell. I then deploy it to windows 2008 R2 and launch the spark-shell, I got the error java.lang.RuntimeException: Error while running command to get file permissions : java.io.IOExceptio n: Cannot run program ls: CreateProcess error=2, The system cannot find the file specified at java.lang.ProcessBuilder.start(Unknown Source) at org.apache.hadoop.util.Shell.runCommand(Shell.java:200) at org.apache.hadoop.util.Shell.run(Shell.java:182) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375) at org.apache.hadoop.util.Shell.execCommand(Shell.java:461) at org.apache.hadoop.util.Shell.execCommand(Shell.java:444) at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:710) at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFil eSystem.java:443) at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getPermission(RawLocalFileSyst em.java:418) Here is the detail output C:\spark-1.1.0\bin spark-shell 15/01/29 10:13:13 INFO SecurityManager: Changing view acls to: ningjun.wang, 15/01/29 10:13:13 INFO SecurityManager: Changing modify acls to: ningjun.wang, 15/01/29 10:13:13 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ningjun.wang, ); users with modify permissions: Set(ningjun.wang, ) 15/01/29 10:13:13 INFO HttpServer: Starting HTTP Server 15/01/29 10:13:14 INFO Server: jetty-8.y.z-SNAPSHOT 15/01/29 10:13:14 INFO AbstractConnector: Started SocketConnector@0.0.0.0:53692 15/01/29 10:13:14 INFO Utils: Successfully started service 'HTTP class server' on port 53692. Failed to created SparkJLineReader: java.lang.NoClassDefFoundError: Could not initialize class org.f usesource.jansi.internal.Kernel32 Falling back to SimpleReader. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.1.0 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71) Type in expressions to have them evaluated. Type :help for more information. 15/01/29 10:13:18 INFO SecurityManager: Changing view acls to: ningjun.wang, 15/01/29 10:13:18 INFO SecurityManager: Changing modify acls to: ningjun.wang, 15/01/29 10:13:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ningjun.wang, ); users with modify permissions: Set(ningjun.wang, ) 15/01/29 10:13:18 INFO Slf4jLogger: Slf4jLogger started 15/01/29 10:13:18 INFO Remoting: Starting remoting 15/01/29 10:13:19 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@L AB4-WIN01.pcc.lexisnexis.com:53705] 15/01/29 10:13:19 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@LAB4-WIN 01.pcc.lexisnexis.com:53705] 15/01/29 10:13:19 INFO Utils: Successfully started service 'sparkDriver' on port 53705. 15/01/29 10:13:19 INFO SparkEnv: Registering MapOutputTracker 15/01/29 10:13:19 INFO SparkEnv: Registering BlockManagerMaster 15/01/29 10:13:19 INFO DiskBlockManager: Created local directory at C:\Users\NINGJU~1.WAN\AppData\Lo cal\Temp\3\spark-local-20150129101319-f9da 15/01/29 10:13:19 INFO Utils: Successfully started service 'Connection manager for block manager' on port 53708. 15/01/29 10:13:19 INFO ConnectionManager: Bound socket to port 53708 with id = ConnectionManagerId(L AB4-WIN01.pcc.lexisnexis.com,53708) 15/01/29 10:13:19 INFO MemoryStore: MemoryStore started with capacity 265.4 MB 15/01/29 10:13:19 INFO BlockManagerMaster: Trying to register BlockManager 15/01/29 10:13:19 INFO BlockManagerMasterActor: Registering block manager LAB4-WIN01.pcc.lexisnexis. com:53708 with 265.4 MB RAM 15/01/29 10:13:19 INFO BlockManagerMaster: Registered BlockManager 15/01/29 10:13:19 INFO HttpFileServer: HTTP File server directory is C:\Users\NINGJU~1.WAN\AppData\L ocal\Temp\3\spark-2f65b1c3-00e2-489b-967c-4e1f41520583 15/01/29 10:13:19 INFO HttpServer: Starting HTTP Server 15/01/29 10:13:19 INFO Server: jetty-8.y.z-SNAPSHOT 15/01/29 10:13:19 INFO AbstractConnector: Started SocketConnector@0.0.0.0:53709 15/01/29 10:13:19 INFO Utils: Successfully started service 'HTTP file server' on port 53709. 15/01/29 10:13:20 INFO Server: jetty-8.y.z-SNAPSHOT 15/01/29 10:13:20 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 15/01/29 10:13:20 INFO Utils: Successfully started service 'SparkUI' on
[documentation] Update the python example ALS of the site?
Hi, In the spark 1.2.0, it requires the ratings should be a RDD of Rating or tuple or list. However, the current example in the site use still RDD[array] as the ratings. Therefore, the example doesn't work under the version 1.2.0. May be we should update the documentation of the site? Thanks a lot. Cheers Gen
Re: Cluster hangs in 'ssh-ready' state using Spark 1.2 EC2 launch script
Hi, This is because ssh-ready is the ec2 scripy means that all the instances are in the status of running and all the instances in the status of OK, In another word, the instances is ready to download and to install software, just as emr is ready for bootstrap actions. Before, the script just repeatedly prints the information showing that we are waiting for every instance being launched.And it is quite ugly, so they change the information to print However, you can use ssh to connect the instance even if it is in the status of pending. If you wait patiently a little more,, the script will finish the launch of cluster. Cheers Gen On Sat, Jan 17, 2015 at 7:00 PM, Nathan Murthy nathan.mur...@gmail.com wrote: Originally posted here: http://stackoverflow.com/questions/28002443/cluster-hangs-in-ssh-ready-state-using-spark-1-2-ec2-launch-script I'm trying to launch a standalone Spark cluster using its pre-packaged EC2 scripts, but it just indefinitely hangs in an 'ssh-ready' state: ubuntu@machine:~/spark-1.2.0-bin-hadoop2.4$ ./ec2/spark-ec2 -k key-pair -i identity-file.pem -r us-west-2 -s 3 launch test Setting up security groups... Searching for existing cluster test... Spark AMI: ami-ae6e0d9e Launching instances... Launched 3 slaves in us-west-2c, regid = r-b___6 Launched master in us-west-2c, regid = r-0__0 Waiting for all instances in cluster to enter 'ssh-ready' state.. Yet I can SSH into these instances without compaint: ubuntu@machine:~$ ssh -i identity-file.pem root@master-ip Last login: Day MMM DD HH:mm:ss 20YY from c-AA-BBB--DDD.eee1.ff.provider.net __| __|_ ) _| ( / Amazon Linux AMI ___|\___|___| https://aws.amazon.com/amazon-linux-ami/2013.03-release-notes/ There are 59 security update(s) out of 257 total update(s) available Run sudo yum update to apply all updates. Amazon Linux version 2014.09 is available. root@ip-internal ~]$ I'm trying to figure out if this is a problem in AWS or with the Spark scripts. I've never had this issue before until recently. -- Nathan Murthy // 713.884.7110 (mobile) // @natemurthy
Re: Did anyone tried overcommit of CPU cores?
Hi, As you said, the --executor-cores will define the max number of tasks that an executor can take simultaneously. So, if you claim 10 cores, it is not possible to launch more than 10 tasks in an executor at the same time. According to my experience, set cores more than physical CPU core will cause overload of CPU at some point of execution of spark application. especially when you are using algorithm in mllib package. In addition, the executor-cores will affect the default level of parallelism of spark. Therefore, I recommend you to set cores = physical cores by default. Moreover, I don't think overcommit cpu will increase the use of CPU. In my opinion, it just increase the waiting queue of CPU. If you observe the CPU load is very low (through ganglia for example) and too much IO, maybe increasing level of parallelism or serializing your object is a good choice. Hoping this helps Cheers Gen On Fri, Jan 9, 2015 at 10:12 AM, Xuelin Cao xuelincao2...@gmail.com wrote: Thanks, but, how to increase the tasks per core? For example, if the application claims 10 cores, is it possible to launch 100 tasks concurrently? On Fri, Jan 9, 2015 at 2:57 PM, Jörn Franke jornfra...@gmail.com wrote: Hallo, Based on experiences with other software in virtualized environments I cannot really recommend this. However, I am not sure how Spark reacts. You may face unpredictable task failures depending on utilization, tasks connecting to external systems (databases etc.) may fail unexpectedly and this might be a problem for them (transactions not finishing etc.). Why not increase the tasks per core? Best regards Le 9 janv. 2015 06:46, Xuelin Cao xuelincao2...@gmail.com a écrit : Hi, I'm wondering whether it is a good idea to overcommit CPU cores on the spark cluster. For example, in our testing cluster, each worker machine has 24 physical CPU cores. However, we are allowed to set the CPU core number to 48 or more in the spark configuration file. As a result, we are allowed to launch more tasks than the number of physical CPU cores. The motivation of overcommit CPU cores is, for many times, a task cannot consume 100% resource of a single CPU core (due to I/O, shuffle, etc.). So, overcommit the CPU cores allows more tasks running at the same time, and makes the resource be used economically. But, is there any reason that we should not doing like this? Anyone tried this? [image: Inline image 1]
Re: Spark on teradata?
Thanks a lot for your reply. In fact, I need to work on almost all the data in teradata (~100T). So, I don't think that jdbcRDD is a good choice. Cheers Gen On Thu, Jan 8, 2015 at 7:39 PM, Reynold Xin r...@databricks.com wrote: Depending on your use cases. If the use case is to extract small amount of data out of teradata, then you can use the JdbcRDD and soon a jdbc input source based on the new Spark SQL external data source API. On Wed, Jan 7, 2015 at 7:14 AM, gen tang gen.tan...@gmail.com wrote: Hi, I have a stupid question: Is it possible to use spark on Teradata data warehouse, please? I read some news on internet which say yes. However, I didn't find any example about this issue Thanks in advance. Cheers Gen
Spark on teradata?
Hi, I have a stupid question: Is it possible to use spark on Teradata data warehouse, please? I read some news on internet which say yes. However, I didn't find any example about this issue Thanks in advance. Cheers Gen
Re: Spark Trainings/ Professional certifications
Hi, I am sorry to bother you, but I couldn't find any information about online test of spark certification managed through Kryterion. Could you please give me the link about it? Thanks a lot in advance. Cheers Gen On Wed, Jan 7, 2015 at 6:18 PM, Paco Nathan cet...@gmail.com wrote: Hi Saurabh, In your area, Big Data Partnership provides Spark training: http://www.bigdatapartnership.com/ As Sean mentioned, there is a certification program via a partnership between O'Reilly Media and Databricks http://www.oreilly.com/go/sparkcert That is offered in two ways, in-person at events such as Strata + Hadoop World http://strataconf.com/ and also an online test managed through Kryterion. Sign up on the O'Reilly page. There are also two MOOCs starting soon on edX through University of California: Intro to Big Data with Apache Spark by Prof. Anthony Joseph, UC Berkeley begins Feb 23 https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x#.VK1pkmTF8gk Scalable Machine Learning Prof. Ameet Talwalkar, UCLA begins Apr 14 https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x#.VK1pkmTF8gk For coaching, arguably you might be best to talk with consultants especially for near-term needs. Contact me off-list and I can help provide intros in your area. Thanks, Paco On Wed, Jan 7, 2015 at 6:38 AM, Saurabh Agrawal saurabh.agra...@markit.com wrote: Hi, Can you please suggest some of the best available trainings/ coaching and professional certifications in Apache Spark? We are trying to run predictive analysis on our Sales data and come out with recommendations (leads). We have tried to run CF but we end up getting absolutely bogus results!! A training that would leave us hands on to do our job effectively is what we are after. In addition to this, if this training could provide a firm ground for a professional certification, that would be an added advantage. Thanks for your inputs Regards, Saurabh Agrawal -- This e-mail, including accompanying communications and attachments, is strictly confidential and only for the intended recipient. Any retention, use or disclosure not expressly authorised by Markit is prohibited. This email is subject to all waivers and other terms at the following link: http://www.markit.com/en/about/legal/email-disclaimer.page Please visit http://www.markit.com/en/about/contact/contact-us.page? for contact information on our offices worldwide. MarkitSERV Limited has its registered office located at Level 4, Ropemaker Place, 25 Ropemaker Street, London, EC2Y 9LY and is authorized and regulated by the Financial Conduct Authority with registration number 207294
Re: Using ec2 launch script with locally built version of spark?
Hi, As the ec2 launch script provided by spark uses https://github.com/mesos/spark-ec2 to download and configure all the tools in the cluster (spark, hadoop etc). You can create your own git repository to achieve your goal. More precisely: 1. Upload your own version of spark in s3 at address path to your spark 2. Fork https://github.com/mesos/spark-ec2 and make a change in ./spark/init.sh (add wget path to your spark) 3. Change line 638 in ec2 launch script: git clone your repository in github Hope this can be helpful. Cheers Gen On Tue, Jan 6, 2015 at 11:51 PM, Ganon Pierce ganon.pie...@me.com wrote: Is there a way to use the ec2 launch script with a locally built version of spark? I launch and destroy clusters pretty frequently and would like to not have to wait each time for the master instance to compile the source as happens when I set the -v tag with the latest git commit. To be clear, I would like to launch a non-release version of spark compiled locally as quickly as I can launch a release version (e.g. -v 1.2.0) which does not have to be compiled upon launch. Up to this point, I have just used the launch script included with the latest release to set up the cluster and then manually replaced the assembly file on the master and slaves with the version I built locally and then stored on s3. Is there anything wrong with doing it this way? Further, is there a better or more standard way of accomplishing this? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org