Re: how to set database in DataFrame.saveAsTable?

2016-02-20 Thread gen tang
Hi, You can use sqlContext.sql("use ") before use dataframe.saveAsTable Hope it could be helpful Cheers Gen On Sun, Feb 21, 2016 at 9:55 AM, Glen <cng...@gmail.com> wrote: > For dataframe in spark, so the table can be visited by hive. > > -- > Jacky Wang >

Re: dataframe slow down with tungsten turn on

2015-11-04 Thread gen tang
Yes, the same code, the same result. In fact, the code has been running for a more one month. Before 1.5.0, the performance is quite the same, So I doubt that it is causd by tungsten. Gen On Wed, Nov 4, 2015 at 4:05 PM, Rick Moritz <rah...@gmail.com> wrote: > Something to check (jus

dataframe slow down with tungsten turn on

2015-11-03 Thread gen tang
(with tungsten turn on), it takes a about 2 hours to finish the same job. I checked the detail of tasks, almost all the time is consumed by computation. Any idea about why this happens? Thanks a lot in advance for your help. Cheers Gen

Re: Spark works with the data in another cluster(Elasticsearch)

2015-08-25 Thread gen tang
Gen On Tue, Aug 25, 2015 at 6:26 PM, Nick Pentreath nick.pentre...@gmail.com wrote: While it's true locality might speed things up, I'd say it's a very bad idea to mix your Spark and ES clusters - if your ES cluster is serving production queries (and in particular using aggregations), you'll run

Spark works with the data in another cluster(Elasticsearch)

2015-08-18 Thread gen tang
cluster. I will be appreciated if someone can share his/her experience about using spark with elasticsearch. Thanks a lot in advance for your help. Cheers Gen

Re: Questions about SparkSQL join on not equality conditions

2015-08-11 Thread gen tang
) that I use is created from hive table(about 1G). Therefore spark think df1 is larger than df2, although df1 is very small. As a result, spark try to do df2.collect(), which causes the error. Hope this could be helpful Cheers Gen On Mon, Aug 10, 2015 at 11:29 PM, gen tang gen.tan...@gmail.com

Re: Questions about SparkSQL join on not equality conditions

2015-08-10 Thread gen tang
is no way bigger than 1G. When I do join on just one condition or equity condition, there will be no problem. Could anyone help me, please? Thanks a lot in advance. Cheers Gen On Sun, Aug 9, 2015 at 9:08 PM, gen tang gen.tan...@gmail.com wrote: Hi, I might have a stupid question about sparksql's

Questions about SparkSQL join on not equality conditions

2015-08-09 Thread gen tang
. So I would like to know how spark implement it. As I observe such join runs very slow, I guess that spark implement it by doing filter on the top of cartesian product. Is it true? Thanks in advance for your help. Cheers Gen

Re: How to get total CPU consumption for Spark job

2015-08-07 Thread gen tang
Hi, Spark UI or logs don't provide the situation of cluster. However, you can use Ganglia to monitor the situation of cluster. In spark-ec2, there is an option to install ganglia automatically. If you use CDH, you can also use Cloudera manager. Cheers Gen On Sat, Aug 8, 2015 at 6:06 AM, Xiao

Re: Problems getting expected results from hbase_inputformat.py

2015-08-07 Thread gen tang
are trying to use this new python script with old jar. You can clone the newest code of spark from github and build examples jar. Then you can get correct result. Cheers Gen On Sat, Aug 8, 2015 at 5:03 AM, Eric Bless eric.bl...@yahoo.com.invalid wrote: I’m having some difficulty getting

Re: Spark MLib v/s SparkR

2015-08-07 Thread gen tang
it can be helpful Cheers Gen On Thu, Aug 6, 2015 at 2:24 AM, praveen S mylogi...@gmail.com wrote: I was wondering when one should go for MLib or SparkR. What is the criteria or what should be considered before choosing either of the solutions for data analysis? or What is the advantages

Re: large scheduler delay in pyspark

2015-08-05 Thread gen tang
, it is not scheduler delay. When computation finishes, UI will show correct scheduler delay time. Cheers Gen On Tue, Aug 4, 2015 at 3:13 PM, Davies Liu dav...@databricks.com wrote: On Mon, Aug 3, 2015 at 9:00 AM, gen tang gen.tan...@gmail.com wrote: Hi, Recently, I met some problems about

large scheduler delay in pyspark

2015-08-03 Thread gen tang
on Yarn. But the first code works fine on the same data. Is there any way to find out the log when spark stall in scheduler delay, please? Or any ideas about this problem? Thanks a lot in advance for your help. Cheers Gen

Strange behavoir of pyspark with --jars option

2015-07-15 Thread gen tang
this interesting problem happens? Thanks a lot for your help in advance. Cheers Gen

Re: pyspark hbase range scan

2015-04-02 Thread gen tang
Hi, Maybe this might be helpful: https://github.com/GenTang/spark_hbase/blob/master/src/main/scala/examples/pythonConverters.scala Cheers Gen On Thu, Apr 2, 2015 at 1:50 AM, Eric Kimbrel eric.kimb...@soteradefense.com wrote: I am attempting to read an hbase table in pyspark with a range scan

Re: Apache Spark ALS recommendations approach

2015-03-18 Thread gen tang
, even the program pass, the time of treatment will be very long. Maybe you should try to reduce the set to predict for each client, as in practice, you never need predict the preference of all products to make a recommendation. Hope this will be helpful. Cheers Gen On Wed, Mar 18, 2015 at 12:13 PM

Re: Does anyone integrate HBASE on Spark

2015-03-04 Thread gen tang
it would be helpful Cheers Gen On Wed, Mar 4, 2015 at 6:51 PM, sandeep vura sandeepv...@gmail.com wrote: Hi Sparkers, How do i integrate hbase on spark !!! Appreciate for replies !! Regards, Sandeep.v

Re: Spark on EC2

2015-02-24 Thread gen tang
familiar with spark. You can do this on your laptop as well as on ec2. In fact, running ./ec2/spark-ec2 means launching spark standalone mode on a cluster, you can find more details here: https://spark.apache.org/docs/latest/spark-standalone.html Cheers Gen On Tue, Feb 24, 2015 at 4:07 PM, Deep

Re: Spark on EC2

2015-02-24 Thread gen tang
, but not on the utilisation of machine. Hope it would help. Cheers Gen On Tue, Feb 24, 2015 at 3:55 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have just signed up for Amazon AWS because I learnt that it provides service for free for the first 12 months. I want to run Spark on EC2

Re: Specifying AMI when using Spark EC-2 scripts

2015-02-15 Thread gen tang
Hi, You can use -a or --ami your ami id to launch the cluster using specific ami. If I remember well, the default system is Amazon Linux. Hope it will help Cheers Gen On Sun, Feb 15, 2015 at 6:20 AM, olegshirokikh o...@solver.com wrote: Hi there, Is there a way to specify the AWS AMI

Re: Loading JSON dataset with Spark Mllib

2015-02-15 Thread gen tang
Cheers Gen On Mon, Feb 16, 2015 at 12:39 AM, pankaj channe pankajc...@gmail.com wrote: Hi, I am new to spark and planning on writing a machine learning application with Spark mllib. My dataset is in json format. Is it possible to load data into spark without using any external json libraries? I

Re: Installing a python library along with ec2 cluster

2015-02-09 Thread gen tang
Hi, Please take a look at http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/creating-an-ami-ebs.html Cheers Gen On Mon, Feb 9, 2015 at 6:41 AM, Chengi Liu chengi.liu...@gmail.com wrote: Hi I am very new both in spark and aws stuff.. Say, I want to install pandas on ec2.. (pip install

Re: Installing a python library along with ec2 cluster

2015-02-08 Thread gen tang
Hi, You can make a image of ec2 with all the python libraries installed and create a bash script to export python_path in the /etc/init.d/ directory. Then you can launch the cluster with this image and ec2.py Hope this can be helpful Cheers Gen On Sun, Feb 8, 2015 at 9:46 AM, Chengi Liu

Re: no space left at worker node

2015-02-08 Thread gen tang
. Cheers Gen On Sun, Feb 8, 2015 at 8:16 AM, ey-chih chow eyc...@hotmail.com wrote: Hi, I submitted a spark job to an ec2 cluster, using spark-submit. At a worker node, there is an exception of 'no space left on device' as follows. == 15/02/08 01:53:38

Re: no space left at worker node

2015-02-08 Thread gen tang
Hi, In fact, /dev/sdb is /dev/xvdb. It seems that there is no problem about double mount. However, there is no information about /mnt2. You should check whether /dev/sdc is well mounted or not. The reply of Micheal is good solution about this type of problem. You can check his site. Cheers Gen

Re: no space left at worker node

2015-02-08 Thread gen tang
problem and find out the specific reason. Cheers Gen On Sun, Feb 8, 2015 at 10:45 PM, ey-chih chow eyc...@hotmail.com wrote: Thanks Gen. How can I check if /dev/sdc is well mounted or not? In general, the problem shows up when I submit the second or third job. The first job I submit most

Re: Pyspark Hbase scan.

2015-02-05 Thread gen tang
Hi, In fact, this pull https://github.com/apache/spark/pull/3920 is to do Hbase scan. However, it is not merged yet. You can also take a look at the example code at http://spark-packages.org/package/20 which is using scala and python to read data from hbase. Hope this can be helpful. Cheers Gen

Re: Fail to launch spark-shell on windows 2008 R2

2015-01-29 Thread gen tang
Gen On Thu, Jan 29, 2015 at 5:54 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: Install virtual box which run Linux? That does not help us. We have business reason to run it on Windows operating system, e.g. Windows 2008 R2. If anybody have done that, please give some

Re: Fail to launch spark-shell on windows 2008 R2

2015-01-29 Thread gen tang
Hi, I tried to use spark under windows once. However the only solution that I found is to install virtualbox Hope this can help you. Best Gen On Thu, Jan 29, 2015 at 4:18 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: I deployed spark-1.1.0 on Windows 7 and was albe

[documentation] Update the python example ALS of the site?

2015-01-27 Thread gen tang
a lot. Cheers Gen

Re: Cluster hangs in 'ssh-ready' state using Spark 1.2 EC2 launch script

2015-01-17 Thread gen tang
will finish the launch of cluster. Cheers Gen On Sat, Jan 17, 2015 at 7:00 PM, Nathan Murthy nathan.mur...@gmail.com wrote: Originally posted here: http://stackoverflow.com/questions/28002443/cluster-hangs-in-ssh-ready-state-using-spark-1-2-ec2-launch-script I'm trying to launch a standalone Spark

Re: Did anyone tried overcommit of CPU cores?

2015-01-09 Thread gen tang
Cheers Gen On Fri, Jan 9, 2015 at 10:12 AM, Xuelin Cao xuelincao2...@gmail.com wrote: Thanks, but, how to increase the tasks per core? For example, if the application claims 10 cores, is it possible to launch 100 tasks concurrently? On Fri, Jan 9, 2015 at 2:57 PM, Jörn Franke jornfra

Re: Spark on teradata?

2015-01-08 Thread gen tang
Thanks a lot for your reply. In fact, I need to work on almost all the data in teradata (~100T). So, I don't think that jdbcRDD is a good choice. Cheers Gen On Thu, Jan 8, 2015 at 7:39 PM, Reynold Xin r...@databricks.com wrote: Depending on your use cases. If the use case is to extract small

Spark on teradata?

2015-01-07 Thread gen tang
Hi, I have a stupid question: Is it possible to use spark on Teradata data warehouse, please? I read some news on internet which say yes. However, I didn't find any example about this issue Thanks in advance. Cheers Gen

Re: Spark Trainings/ Professional certifications

2015-01-07 Thread gen tang
Hi, I am sorry to bother you, but I couldn't find any information about online test of spark certification managed through Kryterion. Could you please give me the link about it? Thanks a lot in advance. Cheers Gen On Wed, Jan 7, 2015 at 6:18 PM, Paco Nathan cet...@gmail.com wrote: Hi Saurabh

Re: Using ec2 launch script with locally built version of spark?

2015-01-06 Thread gen tang
path to your spark 2. Fork https://github.com/mesos/spark-ec2 and make a change in ./spark/init.sh (add wget path to your spark) 3. Change line 638 in ec2 launch script: git clone your repository in github Hope this can be helpful. Cheers Gen On Tue, Jan 6, 2015 at 11:51 PM, Ganon Pierce ganon.pie

Re: MLLib /ALS : java.lang.OutOfMemoryError: Java heap space

2014-12-16 Thread Gen
Hi,How many clients and how many products do you have?CheersGen jaykatukuri wrote Hi all,I am running into an out of memory error while running ALS using MLLIB on a reasonably small data set consisting of around 6 Million ratings.The stack trace is below:java.lang.OutOfMemoryError: Java heap

Re: Why so many tasks?

2014-12-16 Thread Gen
of partitions). Cheers Gen bethesda wrote Our job is creating what appears to be an inordinate number of very small tasks, which blow out our os inode and file limits. Rather than continually upping those limits, we are seeking to understand whether our real problem is that too many tasks

Cannot pickle DecisionTreeModel in the pyspark

2014-12-12 Thread Gen
model is the only model in pyspark that we cannot pickle. FYI: I use spark 1.1.1 Do you have any idea to solve this problem?(I dont know whether using scala can solve this problem or not.) Thanks a lot in advance for your help. Cheers Gen -- View this message in context: http://apache-spark-user

Re: Does filter on an RDD scan every data item ?

2014-12-02 Thread Gen
Hi, For your first question, I think that we can use /sc.parallelize(rdd.take(1000))/ For your second question, I am not sure. But I don't think that we can restricted filter within certain partition without scan every element. Cheers Gen nsareen wrote Hi , I wanted some clarity

Re: --executor-cores cannot change vcores in yarn?

2014-11-03 Thread Gen
to monitor the cpu utilization during the spark task. The spark can use all cpu even I leave --executor-cores as default(1). Hope that it can be a help. Cheers Gen Gen wrote Hi, Maybe it is a stupid question, but I am running spark on yarn. I request the resources by the following command

--executor-cores cannot change vcores in yarn?

2014-11-01 Thread Gen
-status ID / to monitor the situation of cluster. It shows that the number of Vcores used for each container is always 1 no matter what number I pass by --executor-cores. Any ideas how to solve this problem? Thanks a lot in advance for your help. Cheers Gen -- View this message in context: http

Re: Executor and BlockManager memory size

2014-10-31 Thread Gen
.compute.internal:38770 with 1294.1 MB RAM* So, according to the documentation, just 2156.83m is allocated to executor. Moreover, according to yarn 3072m memory is used for this container. Do you have any ideas about this? Thanks a lot Cheers Gen Boromir Widas wrote Hey Larry, I have been

Re: Python code crashing on ReduceByKey if I return custom class object

2014-10-27 Thread Gen
://issues.apache.org/jira/browse/SPARK-2652?filter=-2 https://issues.apache.org/jira/browse/SPARK-2652?filter=-2 . Cheers. Gen sid wrote Hi , I am new to spark and I am trying to use pyspark. I am trying to find mean of 128 dimension vectors present in a file . Below is the code

Re: How to aggregate data in Apach Spark

2014-10-20 Thread Gen
Hi, I will write the code in python {code:title=test.py} data = sc.textFile(...).map(...) ## Please make sure that the rdd is like[[id, c1, c2, c3], [id, c1, c2, c3],...] keypair = data.map(lambda l: ((l[0],l[1],l[2]), float(l[3]))) keypair = keypair.reduceByKey(add) out = keypair.map(lambda l:

Re: ALS implicit error pyspark

2014-10-20 Thread Gen
to there for more information or make a contribution to fix this problem. Cheers Gen Gen wrote Hi, I am trying to use ALS.trainImplicit method in the pyspark.mllib.recommendation. However it didn't work. So I tried use the example in the python API documentation such as: / r1 = (1, 1, 1.0) r2 = (1, 2

Re: ALS implicit error pyspark

2014-10-17 Thread Gen
, for example, ALS.trainImplicit(ratings, rank, 10) and it didn't work. After several test, I found only iterations = 1 works for pyspark. But for scala, all the value works. Best Gen Davies Liu-2 wrote On Thu, Oct 16, 2014 at 9:53 AM, Gen lt; gen.tang86@ gt; wrote: Hi, I am trying

Re: ALS implicit error pyspark

2014-10-17 Thread Gen
at ALS.scala:314 . I will take a look at the log and try to find the problem. Best Gen Davies Liu-2 wrote I can run the following code against Spark 1.1 sc = SparkContext() r1 = (1, 1, 1.0) r2 = (1, 2, 2.0) r3 = (2, 1, 2.0) ratings = sc.parallelize([r1, r2, r3]) model = ALS.trainImplicit(ratings

Re: ALS implicit error pyspark

2014-10-17 Thread Gen
Hi, I created an issue in JIRA. https://issues.apache.org/jira/browse/SPARK-3990 https://issues.apache.org/jira/browse/SPARK-3990 I uploaded the error information in JIRA. Thanks in advance for your help. Best Gen Davies Liu-2 wrote It seems a bug, Could you create a JIRA for it? thanks

ALS implicit error pyspark

2014-10-16 Thread Gen
Hi, I am trying to use ALS.trainImplicit method in the pyspark.mllib.recommendation. However it didn't work. So I tried use the example in the python API documentation such as: /r1 = (1, 1, 1.0) r2 = (1, 2, 2.0) r3 = (2, 1, 2.0) ratings = sc.parallelize([r1, r2, r3]) model =

Re: How to make operation like cogrop() , groupbykey() on pair RDD = [ [ ], [ ] , [ ] ]

2014-10-16 Thread Gen
Hi, You just need add list() in the sorted function. For example, map((lambda (x,y): (x, (list(y[0]), list(y[1], sorted(list(rdd1.cogroup(rdd2).collect( I think you just forget the list... PS: your post has NOT been accepted by the mailing list yet. Best Gen pm wrote Hi

Re: ALS implicit error pyspark

2014-10-16 Thread Gen
TaskSet 975.0, whose tasks have all completed, from pool Gen wrote Hi, I am trying to use ALS.trainImplicit method in the pyspark.mllib.recommendation. However it didn't work. So I tried use the example in the python API documentation such as: / r1 = (1, 1, 1.0) r2 = (1, 2, 2.0) r3 = (2

Re: How to make operation like cogrop() , groupbykey() on pair RDD = [ [ ], [ ] , [ ] ]

2014-10-15 Thread Gen
What results do you want? If your pair is like (a, b), where a is the key and b is the value, you can try rdd1 = rdd1.flatMap(lambda l: l) and then use cogroup. Best Gen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-operation-like-cogrop

Re: S3 Bucket Access

2014-10-14 Thread Gen
Hi, Are you sure that the id/key that you used can access to s3? You can try to use the same id/key through python boto package to test it. Because, I have almost the same situation as yours, but I can access to s3. Best -- View this message in context:

Re: SparkSQL: select syntax

2014-10-14 Thread Gen
error. Cheers Gen Hao Ren wrote Update: This syntax is mainly for avoiding retyping column names. Let's take the example in my previous post, where * a * is a table of 15 columns, * b * has 5 columns, after a join, I have a table of (15 + 5 - 1(key in b)) = 19 columns and register

Re: S3 Bucket Access

2014-10-14 Thread Gen
Hi, If I remember well, spark cannot use the IAMrole credentials to access to s3. It use at first the id/key in the environment. If it is null in the environment, it use the value in the file core-site.xml. So, IAMrole is not useful for spark. The same problem happens if you want to use distcp

Re: Spark SQL -- more than two tables for join

2014-10-07 Thread TANG Gen
Hi, the same problem happens when I try several joins together, such as 'SELECT * FROM sales INNER JOIN magasin ON sales.STO_KEY = magasin.STO_KEY INNER JOIN eans ON (sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY = eans.FORM_KEY)' The error information is as follow:

Re: Spark SQL -- more than two tables for join

2014-10-07 Thread Gen
Hi, in fact, the same problem happens when I try several joins together: SELECT * FROM sales INNER JOIN magasin ON sales.STO_KEY = magasin.STO_KEY INNER JOIN eans ON (sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY = eans.FORM_KEY) py4j.protocol.Py4JJavaError: An error occurred while

The question about mount ephemeral disk in slave-setup.sh

2014-10-03 Thread TANG Gen
that these disks are only mounted if the instance begins with r3. For other instance types, are their ephemeral disk mounted or not? If yes, which script mounts them or they are mounted automatically by AWS? Thanks a lot for your help in advance. Best regards Gen -- View this message in context: http

Re: The question about mount ephemeral disk in slave-setup.sh

2014-10-03 Thread TANG Gen
I have taken a look at the code of mesos spark-ec2 and documentation of AWS. I think that maybe I found the answer. In fact, there are two types AMI in AWS EBS backed AMI and instance store backed AMI. For EBS backed AMI, we can add instance store volume when we create the images(The details can

Re: Spark Monitoring with Ganglia

2014-10-03 Thread TANG Gen
Maybe you can follow the instruction in this link https://github.com/mesos/spark-ec2/tree/v3/ganglia https://github.com/mesos/spark-ec2/tree/v3/ganglia . For me it works well -- View this message in context:

Re: pyspark on python 3

2014-10-03 Thread Gen
According to the official site of spark, for the latest version of spark(1.1.0), it does not work with python 3 Spark 1.1.0 works with Python 2.6 or higher (but not Python 3). It uses the standard CPython interpreter, so C libraries like NumPy can be used. -- View this message in context:

Re: partitions number with variable number of cores

2014-10-03 Thread Gen
Maybe I am wrong, but how many resource that a spark application can use depends on the mode of deployment(the type of resource manager), you can take a look at https://spark.apache.org/docs/latest/job-scheduling.html https://spark.apache.org/docs/latest/job-scheduling.html . For your case, I