Hi,
You can use
sqlContext.sql("use ")
before use dataframe.saveAsTable
Hope it could be helpful
Cheers
Gen
On Sun, Feb 21, 2016 at 9:55 AM, Glen wrote:
> For dataframe in spark, so the table can be visited by hive.
>
> --
> Jacky Wang
>
t in case):
> Are you getting identical results each time?
>
> On Wed, Nov 4, 2015 at 8:54 AM, gen tang <gen.tan...@gmail.com> wrote:
>
>> Hi sparkers,
>>
>> I am using dataframe to do some large ETL jobs.
>> More precisely, I create dataframe from HI
Hi sparkers,
I am using dataframe to do some large ETL jobs.
More precisely, I create dataframe from HIVE table and do some operations.
And then I save it as json.
When I used spark-1.4.1, the whole process is quite fast, about 1 mins.
However, when I use the same code with spark-1.5.1(with
/performance_optimization/data_locality.html
.
Thanks
Best Regards
On Tue, Aug 18, 2015 at 8:09 PM, gen tang gen.tan...@gmail.com wrote:
Hi,
Currently, I have my data in the cluster of Elasticsearch and I try to
use spark to analyse those data.
The cluster of Elasticsearch and the cluster of spark
Hi,
Currently, I have my data in the cluster of Elasticsearch and I try to use
spark to analyse those data.
The cluster of Elasticsearch and the cluster of spark are two different
clusters. And I use hadoop input format(es-hadoop) to read data in ES.
I am wondering how this environment affect
) that I use is created from hive table(about 1G). Therefore
spark think df1 is larger than df2, although df1 is very small. As a
result, spark try to do df2.collect(), which causes the error.
Hope this could be helpful
Cheers
Gen
On Mon, Aug 10, 2015 at 11:29 PM, gen tang gen.tan...@gmail.com
is no way bigger than 1G.
When I do join on just one condition or equity condition, there will be no
problem.
Could anyone help me, please?
Thanks a lot in advance.
Cheers
Gen
On Sun, Aug 9, 2015 at 9:08 PM, gen tang gen.tan...@gmail.com wrote:
Hi,
I might have a stupid question about sparksql's
Hi,
I might have a stupid question about sparksql's implementation of join on
not equality conditions, for instance condition1 or condition2.
In fact, Hive doesn't support such join, as it is very difficult to express
such conditions as a map/reduce job. However, sparksql supports such
Hi,
Spark UI or logs don't provide the situation of cluster. However, you can
use Ganglia to monitor the situation of cluster. In spark-ec2, there is an
option to install ganglia automatically.
If you use CDH, you can also use Cloudera manager.
Cheers
Gen
On Sat, Aug 8, 2015 at 6:06 AM, Xiao
Hi,
In fact, Pyspark use
org.apache.spark.examples.pythonconverters(./examples/src/main/scala/org/apache/spark/pythonconverters/)
to transform object of Hbase result to python string.
Spark update these two scripts recently. However, they are not included in
the official release of spark. So you
Hi,
It depends on the problem that you work on.
Just as python and R, Mllib focuses on machine learning and SparkR will
focus on statistics, if SparkR follow the way of R.
For instance, If you want to use glm to analyse data:
1. if you are interested only in parameters of model, and use this
, it is not scheduler delay. When computation
finishes, UI will show correct scheduler delay time.
Cheers
Gen
On Tue, Aug 4, 2015 at 3:13 PM, Davies Liu dav...@databricks.com wrote:
On Mon, Aug 3, 2015 at 9:00 AM, gen tang gen.tan...@gmail.com wrote:
Hi,
Recently, I met some problems about
Hi,
Recently, I met some problems about scheduler delay in pyspark. I worked
several days on this problem, but not success. Therefore, I come to here to
ask for help.
I have a key_value pair rdd like rdd[(key, list[dict])] and I tried to
merge value by adding two list
if I do reduceByKey as
Hi,
I met some interesting problems with --jars options
As I use the third party dependencies: elasticsearch-spark, I pass this jar
with the following command:
./bin/spark-submit --jars path-to-dependencies ...
It works well.
However, if I use HiveContext.sql, spark will lost the dependencies
Hi,
Maybe this might be helpful:
https://github.com/GenTang/spark_hbase/blob/master/src/main/scala/examples/pythonConverters.scala
Cheers
Gen
On Thu, Apr 2, 2015 at 1:50 AM, Eric Kimbrel eric.kimb...@soteradefense.com
wrote:
I am attempting to read an hbase table in pyspark with a range
Hi,
If you do cartesian join to predict users' preference over all the
products, I think that 8 nodes with 64GB ram would not be enough for the
data.
Recently, I used als for a similar situation, but just 10M users and 0.1M
products, the minimum requirement is 9 nodes with 10GB RAM.
Moreover,
Hi,
There are some examples in spark/example
https://github.com/apache/spark/tree/master/examples and there are also
some examples in spark package http://spark-packages.org/.
And I find this blog
http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html
is quite good.
Hope it
24, 2015 at 8:32 PM, gen tang gen.tan...@gmail.com wrote:
Hi,
As a real spark cluster needs a least one master and one slaves, you need
to launch two machine. Therefore the second machine is not free.
However, If you run spark on local mode on a ec2 machine. It is free.
The charge of AWS
Hi,
As a real spark cluster needs a least one master and one slaves, you need
to launch two machine. Therefore the second machine is not free.
However, If you run spark on local mode on a ec2 machine. It is free.
The charge of AWS depends on how much and the types of machine that you
launched,
Hi,
You can use -a or --ami your ami id to launch the cluster using specific
ami.
If I remember well, the default system is Amazon Linux.
Hope it will help
Cheers
Gen
On Sun, Feb 15, 2015 at 6:20 AM, olegshirokikh o...@solver.com wrote:
Hi there,
Is there a way to specify the AWS AMI with
Hi,
In fact, you can use sqlCtx.jsonFile() which loads a text file storing one
JSON object per line as a SchemaRDD.
Or you can use sc.textFile() to load the textFile to RDD and then use
sqlCtx.jsonRDD() which loads an RDD storing one JSON object per string as a
SchemaRDD.
Hope it could help
pandas)
How do I create the image and the above library which would be used from
pyspark.
Thanks
On Sun, Feb 8, 2015 at 3:03 AM, gen tang gen.tan...@gmail.com wrote:
Hi,
You can make a image of ec2 with all the python libraries installed and
create a bash script to export python_path
Hi,
You can make a image of ec2 with all the python libraries installed and
create a bash script to export python_path in the /etc/init.d/ directory.
Then you can launch the cluster with this image and ec2.py
Hope this can be helpful
Cheers
Gen
On Sun, Feb 8, 2015 at 9:46 AM, Chengi Liu
Hi,
I fact, I met this problem before. it is a bug of AWS. Which type of
machine do you use?
If I guess well, you can check the file /etc/fstab. There would be a double
mount of /dev/xvdb.
If yes, you should
1. stop hdfs
2. umount /dev/xvdb at /
3. restart hdfs
Hope this could be helpful.
Hi,
In fact, /dev/sdb is /dev/xvdb. It seems that there is no problem about
double mount. However, there is no information about /mnt2. You should
check whether /dev/sdc is well mounted or not.
The reply of Micheal is good solution about this type of problem. You can
check his site.
Cheers
Gen
Hi,
I am sorry that I made a mistake. r3.large has only one SSD which has been
mounted in /mnt. Therefore this is no /dev/sdc.
In fact, the problem is that there is no space in the under / directory. So
you should check whether your application write data under this
directory(for instance, save
Hi,
In fact, this pull https://github.com/apache/spark/pull/3920 is to do Hbase
scan. However, it is not merged yet.
You can also take a look at the example code at
http://spark-packages.org/package/20 which is using scala and python to
read data from hbase.
Hope this can be helpful.
Cheers
Gen
advise on what version of
spark, which version of Hadoop do you built spark against, etc…. Note that
we only use local file system and do not have any hdfs file system at all.
I don’t understand why spark generate so many error on Hadoop while we
don’t even need hdfs.
Ningjun
*From:* gen
Hi,
I tried to use spark under windows once. However the only solution that I
found is to install virtualbox
Hope this can help you.
Best
Gen
On Thu, Jan 29, 2015 at 4:18 PM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:
I deployed spark-1.1.0 on Windows 7 and was albe to
Hi,
In the spark 1.2.0, it requires the ratings should be a RDD of Rating or
tuple or list. However, the current example in the site use still
RDD[array] as the ratings. Therefore, the example doesn't work under the
version 1.2.0.
May be we should update the documentation of the site?
Thanks a
Hi,
This is because ssh-ready is the ec2 scripy means that all the instances
are in the status of running and all the instances in the status of OK,
In another word, the instances is ready to download and to install
software, just as emr is ready for bootstrap actions.
Before, the script just
Hi,
As you said, the --executor-cores will define the max number of tasks that
an executor can take simultaneously. So, if you claim 10 cores, it is not
possible to launch more than 10 tasks in an executor at the same time.
According to my experience, set cores more than physical CPU core will
amount of
data out of teradata, then you can use the JdbcRDD and soon a jdbc input
source based on the new Spark SQL external data source API.
On Wed, Jan 7, 2015 at 7:14 AM, gen tang gen.tan...@gmail.com wrote:
Hi,
I have a stupid question:
Is it possible to use spark on Teradata data
Hi,
I have a stupid question:
Is it possible to use spark on Teradata data warehouse, please? I read some
news on internet which say yes. However, I didn't find any example about
this issue
Thanks in advance.
Cheers
Gen
Hi,
I am sorry to bother you, but I couldn't find any information about online
test of spark certification managed through Kryterion.
Could you please give me the link about it?
Thanks a lot in advance.
Cheers
Gen
On Wed, Jan 7, 2015 at 6:18 PM, Paco Nathan cet...@gmail.com wrote:
Hi
Hi,
As the ec2 launch script provided by spark uses
https://github.com/mesos/spark-ec2 to download and configure all the tools
in the cluster (spark, hadoop etc). You can create your own git repository
to achieve your goal. More precisely:
1. Upload your own version of spark in s3 at address
36 matches
Mail list logo