Re: how to set database in DataFrame.saveAsTable?

2016-02-20 Thread gen tang
Hi, You can use sqlContext.sql("use ") before use dataframe.saveAsTable Hope it could be helpful Cheers Gen On Sun, Feb 21, 2016 at 9:55 AM, Glen wrote: > For dataframe in spark, so the table can be visited by hive. > > -- > Jacky Wang >

Re: dataframe slow down with tungsten turn on

2015-11-04 Thread gen tang
t in case): > Are you getting identical results each time? > > On Wed, Nov 4, 2015 at 8:54 AM, gen tang <gen.tan...@gmail.com> wrote: > >> Hi sparkers, >> >> I am using dataframe to do some large ETL jobs. >> More precisely, I create dataframe from HI

dataframe slow down with tungsten turn on

2015-11-03 Thread gen tang
Hi sparkers, I am using dataframe to do some large ETL jobs. More precisely, I create dataframe from HIVE table and do some operations. And then I save it as json. When I used spark-1.4.1, the whole process is quite fast, about 1 mins. However, when I use the same code with spark-1.5.1(with

Re: Spark works with the data in another cluster(Elasticsearch)

2015-08-25 Thread gen tang
/performance_optimization/data_locality.html . Thanks Best Regards On Tue, Aug 18, 2015 at 8:09 PM, gen tang gen.tan...@gmail.com wrote: Hi, Currently, I have my data in the cluster of Elasticsearch and I try to use spark to analyse those data. The cluster of Elasticsearch and the cluster of spark

Spark works with the data in another cluster(Elasticsearch)

2015-08-18 Thread gen tang
Hi, Currently, I have my data in the cluster of Elasticsearch and I try to use spark to analyse those data. The cluster of Elasticsearch and the cluster of spark are two different clusters. And I use hadoop input format(es-hadoop) to read data in ES. I am wondering how this environment affect

Re: Questions about SparkSQL join on not equality conditions

2015-08-11 Thread gen tang
) that I use is created from hive table(about 1G). Therefore spark think df1 is larger than df2, although df1 is very small. As a result, spark try to do df2.collect(), which causes the error. Hope this could be helpful Cheers Gen On Mon, Aug 10, 2015 at 11:29 PM, gen tang gen.tan...@gmail.com

Re: Questions about SparkSQL join on not equality conditions

2015-08-10 Thread gen tang
is no way bigger than 1G. When I do join on just one condition or equity condition, there will be no problem. Could anyone help me, please? Thanks a lot in advance. Cheers Gen On Sun, Aug 9, 2015 at 9:08 PM, gen tang gen.tan...@gmail.com wrote: Hi, I might have a stupid question about sparksql's

Questions about SparkSQL join on not equality conditions

2015-08-09 Thread gen tang
Hi, I might have a stupid question about sparksql's implementation of join on not equality conditions, for instance condition1 or condition2. In fact, Hive doesn't support such join, as it is very difficult to express such conditions as a map/reduce job. However, sparksql supports such

Re: How to get total CPU consumption for Spark job

2015-08-07 Thread gen tang
Hi, Spark UI or logs don't provide the situation of cluster. However, you can use Ganglia to monitor the situation of cluster. In spark-ec2, there is an option to install ganglia automatically. If you use CDH, you can also use Cloudera manager. Cheers Gen On Sat, Aug 8, 2015 at 6:06 AM, Xiao

Re: Problems getting expected results from hbase_inputformat.py

2015-08-07 Thread gen tang
Hi, In fact, Pyspark use org.apache.spark.examples.pythonconverters(./examples/src/main/scala/org/apache/spark/pythonconverters/) to transform object of Hbase result to python string. Spark update these two scripts recently. However, they are not included in the official release of spark. So you

Re: Spark MLib v/s SparkR

2015-08-07 Thread gen tang
Hi, It depends on the problem that you work on. Just as python and R, Mllib focuses on machine learning and SparkR will focus on statistics, if SparkR follow the way of R. For instance, If you want to use glm to analyse data: 1. if you are interested only in parameters of model, and use this

Re: large scheduler delay in pyspark

2015-08-05 Thread gen tang
, it is not scheduler delay. When computation finishes, UI will show correct scheduler delay time. Cheers Gen On Tue, Aug 4, 2015 at 3:13 PM, Davies Liu dav...@databricks.com wrote: On Mon, Aug 3, 2015 at 9:00 AM, gen tang gen.tan...@gmail.com wrote: Hi, Recently, I met some problems about

large scheduler delay in pyspark

2015-08-03 Thread gen tang
Hi, Recently, I met some problems about scheduler delay in pyspark. I worked several days on this problem, but not success. Therefore, I come to here to ask for help. I have a key_value pair rdd like rdd[(key, list[dict])] and I tried to merge value by adding two list if I do reduceByKey as

Strange behavoir of pyspark with --jars option

2015-07-15 Thread gen tang
Hi, I met some interesting problems with --jars options As I use the third party dependencies: elasticsearch-spark, I pass this jar with the following command: ./bin/spark-submit --jars path-to-dependencies ... It works well. However, if I use HiveContext.sql, spark will lost the dependencies

Re: pyspark hbase range scan

2015-04-02 Thread gen tang
Hi, Maybe this might be helpful: https://github.com/GenTang/spark_hbase/blob/master/src/main/scala/examples/pythonConverters.scala Cheers Gen On Thu, Apr 2, 2015 at 1:50 AM, Eric Kimbrel eric.kimb...@soteradefense.com wrote: I am attempting to read an hbase table in pyspark with a range

Re: Apache Spark ALS recommendations approach

2015-03-18 Thread gen tang
Hi, If you do cartesian join to predict users' preference over all the products, I think that 8 nodes with 64GB ram would not be enough for the data. Recently, I used als for a similar situation, but just 10M users and 0.1M products, the minimum requirement is 9 nodes with 10GB RAM. Moreover,

Re: Does anyone integrate HBASE on Spark

2015-03-04 Thread gen tang
Hi, There are some examples in spark/example https://github.com/apache/spark/tree/master/examples and there are also some examples in spark package http://spark-packages.org/. And I find this blog http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html is quite good. Hope it

Re: Spark on EC2

2015-02-24 Thread gen tang
24, 2015 at 8:32 PM, gen tang gen.tan...@gmail.com wrote: Hi, As a real spark cluster needs a least one master and one slaves, you need to launch two machine. Therefore the second machine is not free. However, If you run spark on local mode on a ec2 machine. It is free. The charge of AWS

Re: Spark on EC2

2015-02-24 Thread gen tang
Hi, As a real spark cluster needs a least one master and one slaves, you need to launch two machine. Therefore the second machine is not free. However, If you run spark on local mode on a ec2 machine. It is free. The charge of AWS depends on how much and the types of machine that you launched,

Re: Specifying AMI when using Spark EC-2 scripts

2015-02-15 Thread gen tang
Hi, You can use -a or --ami your ami id to launch the cluster using specific ami. If I remember well, the default system is Amazon Linux. Hope it will help Cheers Gen On Sun, Feb 15, 2015 at 6:20 AM, olegshirokikh o...@solver.com wrote: Hi there, Is there a way to specify the AWS AMI with

Re: Loading JSON dataset with Spark Mllib

2015-02-15 Thread gen tang
Hi, In fact, you can use sqlCtx.jsonFile() which loads a text file storing one JSON object per line as a SchemaRDD. Or you can use sc.textFile() to load the textFile to RDD and then use sqlCtx.jsonRDD() which loads an RDD storing one JSON object per string as a SchemaRDD. Hope it could help

Re: Installing a python library along with ec2 cluster

2015-02-09 Thread gen tang
pandas) How do I create the image and the above library which would be used from pyspark. Thanks On Sun, Feb 8, 2015 at 3:03 AM, gen tang gen.tan...@gmail.com wrote: Hi, You can make a image of ec2 with all the python libraries installed and create a bash script to export python_path

Re: Installing a python library along with ec2 cluster

2015-02-08 Thread gen tang
Hi, You can make a image of ec2 with all the python libraries installed and create a bash script to export python_path in the /etc/init.d/ directory. Then you can launch the cluster with this image and ec2.py Hope this can be helpful Cheers Gen On Sun, Feb 8, 2015 at 9:46 AM, Chengi Liu

Re: no space left at worker node

2015-02-08 Thread gen tang
Hi, I fact, I met this problem before. it is a bug of AWS. Which type of machine do you use? If I guess well, you can check the file /etc/fstab. There would be a double mount of /dev/xvdb. If yes, you should 1. stop hdfs 2. umount /dev/xvdb at / 3. restart hdfs Hope this could be helpful.

Re: no space left at worker node

2015-02-08 Thread gen tang
Hi, In fact, /dev/sdb is /dev/xvdb. It seems that there is no problem about double mount. However, there is no information about /mnt2. You should check whether /dev/sdc is well mounted or not. The reply of Micheal is good solution about this type of problem. You can check his site. Cheers Gen

Re: no space left at worker node

2015-02-08 Thread gen tang
Hi, I am sorry that I made a mistake. r3.large has only one SSD which has been mounted in /mnt. Therefore this is no /dev/sdc. In fact, the problem is that there is no space in the under / directory. So you should check whether your application write data under this directory(for instance, save

Re: Pyspark Hbase scan.

2015-02-05 Thread gen tang
Hi, In fact, this pull https://github.com/apache/spark/pull/3920 is to do Hbase scan. However, it is not merged yet. You can also take a look at the example code at http://spark-packages.org/package/20 which is using scala and python to read data from hbase. Hope this can be helpful. Cheers Gen

Re: Fail to launch spark-shell on windows 2008 R2

2015-01-29 Thread gen tang
advise on what version of spark, which version of Hadoop do you built spark against, etc…. Note that we only use local file system and do not have any hdfs file system at all. I don’t understand why spark generate so many error on Hadoop while we don’t even need hdfs. Ningjun *From:* gen

Re: Fail to launch spark-shell on windows 2008 R2

2015-01-29 Thread gen tang
Hi, I tried to use spark under windows once. However the only solution that I found is to install virtualbox Hope this can help you. Best Gen On Thu, Jan 29, 2015 at 4:18 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: I deployed spark-1.1.0 on Windows 7 and was albe to

[documentation] Update the python example ALS of the site?

2015-01-27 Thread gen tang
Hi, In the spark 1.2.0, it requires the ratings should be a RDD of Rating or tuple or list. However, the current example in the site use still RDD[array] as the ratings. Therefore, the example doesn't work under the version 1.2.0. May be we should update the documentation of the site? Thanks a

Re: Cluster hangs in 'ssh-ready' state using Spark 1.2 EC2 launch script

2015-01-17 Thread gen tang
Hi, This is because ssh-ready is the ec2 scripy means that all the instances are in the status of running and all the instances in the status of OK, In another word, the instances is ready to download and to install software, just as emr is ready for bootstrap actions. Before, the script just

Re: Did anyone tried overcommit of CPU cores?

2015-01-09 Thread gen tang
Hi, As you said, the --executor-cores will define the max number of tasks that an executor can take simultaneously. So, if you claim 10 cores, it is not possible to launch more than 10 tasks in an executor at the same time. According to my experience, set cores more than physical CPU core will

Re: Spark on teradata?

2015-01-08 Thread gen tang
amount of data out of teradata, then you can use the JdbcRDD and soon a jdbc input source based on the new Spark SQL external data source API. On Wed, Jan 7, 2015 at 7:14 AM, gen tang gen.tan...@gmail.com wrote: Hi, I have a stupid question: Is it possible to use spark on Teradata data

Spark on teradata?

2015-01-07 Thread gen tang
Hi, I have a stupid question: Is it possible to use spark on Teradata data warehouse, please? I read some news on internet which say yes. However, I didn't find any example about this issue Thanks in advance. Cheers Gen

Re: Spark Trainings/ Professional certifications

2015-01-07 Thread gen tang
Hi, I am sorry to bother you, but I couldn't find any information about online test of spark certification managed through Kryterion. Could you please give me the link about it? Thanks a lot in advance. Cheers Gen On Wed, Jan 7, 2015 at 6:18 PM, Paco Nathan cet...@gmail.com wrote: Hi

Re: Using ec2 launch script with locally built version of spark?

2015-01-06 Thread gen tang
Hi, As the ec2 launch script provided by spark uses https://github.com/mesos/spark-ec2 to download and configure all the tools in the cluster (spark, hadoop etc). You can create your own git repository to achieve your goal. More precisely: 1. Upload your own version of spark in s3 at address