Re: how to set database in DataFrame.saveAsTable?

2016-02-20 Thread gen tang
Hi,

You can use
sqlContext.sql("use ")
before use dataframe.saveAsTable

Hope it could be helpful

Cheers
Gen


On Sun, Feb 21, 2016 at 9:55 AM, Glen  wrote:

> For dataframe in spark, so the table can be visited by hive.
>
> --
> Jacky Wang
>


Re: dataframe slow down with tungsten turn on

2015-11-04 Thread gen tang
Yes, the same code, the same result.
In fact, the code has been running for a more one month. Before 1.5.0, the
performance is quite the same, So I doubt that it is causd by tungsten.

Gen

On Wed, Nov 4, 2015 at 4:05 PM, Rick Moritz <rah...@gmail.com> wrote:

> Something to check (just in case):
> Are you getting identical results each time?
>
> On Wed, Nov 4, 2015 at 8:54 AM, gen tang <gen.tan...@gmail.com> wrote:
>
>> Hi sparkers,
>>
>> I am using dataframe to do some large ETL jobs.
>> More precisely, I create dataframe from HIVE table and do some
>> operations. And then I save it as json.
>>
>> When I used spark-1.4.1, the whole process is quite fast, about 1 mins.
>> However, when I use the same code with spark-1.5.1(with tungsten turn on),
>> it takes a about 2 hours to finish the same job.
>>
>> I checked the detail of tasks, almost all the time is consumed by
>> computation.
>>
>> Any idea about why this happens?
>>
>> Thanks a lot in advance for your help.
>>
>> Cheers
>> Gen
>>
>>
>


dataframe slow down with tungsten turn on

2015-11-03 Thread gen tang
Hi sparkers,

I am using dataframe to do some large ETL jobs.
More precisely, I create dataframe from HIVE table and do some operations.
And then I save it as json.

When I used spark-1.4.1, the whole process is quite fast, about 1 mins.
However, when I use the same code with spark-1.5.1(with tungsten turn on),
it takes a about 2 hours to finish the same job.

I checked the detail of tasks, almost all the time is consumed by
computation.

Any idea about why this happens?

Thanks a lot in advance for your help.

Cheers
Gen


Re: Spark works with the data in another cluster(Elasticsearch)

2015-08-25 Thread gen tang
Great advice.
Thanks a lot Nick.

In fact, if we use rdd.persist(DISK) command at the beginning of the
program to avoid hitting the network again and again. The speed is not
influenced a lot. In my case, it is just 1 min more compared to the
situation that we put the data in local HDFS.

Cheers
Gen

On Tue, Aug 25, 2015 at 6:26 PM, Nick Pentreath nick.pentre...@gmail.com
wrote:

 While it's true locality might speed things up, I'd say it's a very bad
 idea to mix your Spark and ES clusters - if your ES cluster is serving
 production queries (and in particular using aggregations), you'll run into
 performance issues on your production ES cluster.

 ES-hadoop uses ES scan  scroll to pull data pretty efficiently, so
 pulling it across the network is not too bad. If you do need to avoid that,
 pull the data and write what you need to HDFS as say parquet files (eg pull
 data daily and write it, then you have all data available on your Spark
 cluster).

 And of course ensure thatbwhen you do pull data from ES to Spark, you
 cache it to avoid hitting the network again

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Tue, Aug 25, 2015 at 12:01 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 If the data is local to the machine then obviously it will be faster
 compared to pulling it through the network and storing it locally (either
 memory or disk etc). Have a look at the data locality
 http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html
 .

 Thanks
 Best Regards

 On Tue, Aug 18, 2015 at 8:09 PM, gen tang gen.tan...@gmail.com wrote:

 Hi,

 Currently, I have my data in the cluster of Elasticsearch and I try to
 use spark to analyse those data.
 The cluster of Elasticsearch and the cluster of spark are two different
 clusters. And I use hadoop input format(es-hadoop) to read data in ES.

 I am wondering how this environment affect the speed of analysis.
 If I understand well, spark will read data from ES cluster and do
 calculate on its own cluster(include writing shuffle result on its own
 machine), Is this right? If this is correct, I think that the performance
 will just a little bit slower than the data stored on the same cluster.

 I will be appreciated if someone can share his/her experience about
 using spark with elasticsearch.

 Thanks a lot in advance for your help.

 Cheers
 Gen






Spark works with the data in another cluster(Elasticsearch)

2015-08-18 Thread gen tang
Hi,

Currently, I have my data in the cluster of Elasticsearch and I try to use
spark to analyse those data.
The cluster of Elasticsearch and the cluster of spark are two different
clusters. And I use hadoop input format(es-hadoop) to read data in ES.

I am wondering how this environment affect the speed of analysis.
If I understand well, spark will read data from ES cluster and do calculate
on its own cluster(include writing shuffle result on its own machine), Is
this right? If this is correct, I think that the performance will just a
little bit slower than the data stored on the same cluster.

I will be appreciated if someone can share his/her experience about using
spark with elasticsearch.

Thanks a lot in advance for your help.

Cheers
Gen


Re: Questions about SparkSQL join on not equality conditions

2015-08-11 Thread gen tang
Hi,

After taking a look at the code, I found out the problem:
As spark will use broadcastNestedLoopJoin to treat nonequality condition.
And one of my dataframe(df1) is created from an existing RDD(logicalRDD),
so it uses defaultSizeInBytes * length to estimate the size. The other
dataframe(df2) that I use is created from hive table(about 1G). Therefore
spark think df1 is larger than df2, although df1 is very small. As a
result, spark try to do df2.collect(), which causes the error.

Hope this could be helpful

Cheers
Gen


On Mon, Aug 10, 2015 at 11:29 PM, gen tang gen.tan...@gmail.com wrote:

 Hi,

 I am sorry to bother again.
 When I do join as follow:
 df = sqlContext.sql(selet a.someItem, b.someItem from a full outer join b
 on condition1 *or* condition2)
 df.first()

 The program failed at the result size is bigger than 
 spark.driver.maxResultSize.
 It is really strange, as one record is no way bigger than 1G.
 When I do join on just one condition or equity condition, there will be no
 problem.

 Could anyone help me, please?

 Thanks a lot in advance.

 Cheers
 Gen


 On Sun, Aug 9, 2015 at 9:08 PM, gen tang gen.tan...@gmail.com wrote:

 Hi,

 I might have a stupid question about sparksql's implementation of join on
 not equality conditions, for instance condition1 or condition2.

 In fact, Hive doesn't support such join, as it is very difficult to
 express such conditions as a map/reduce job. However, sparksql supports
 such operation. So I would like to know how spark implement it.

 As I observe such join runs very slow, I guess that spark implement it by
 doing filter on the top of cartesian product. Is it true?

 Thanks in advance for your help.

 Cheers
 Gen






Re: Questions about SparkSQL join on not equality conditions

2015-08-10 Thread gen tang
Hi,

I am sorry to bother again.
When I do join as follow:
df = sqlContext.sql(selet a.someItem, b.someItem from a full outer join b
on condition1 *or* condition2)
df.first()

The program failed at the result size is bigger than
spark.driver.maxResultSize.
It is really strange, as one record is no way bigger than 1G.
When I do join on just one condition or equity condition, there will be no
problem.

Could anyone help me, please?

Thanks a lot in advance.

Cheers
Gen


On Sun, Aug 9, 2015 at 9:08 PM, gen tang gen.tan...@gmail.com wrote:

 Hi,

 I might have a stupid question about sparksql's implementation of join on
 not equality conditions, for instance condition1 or condition2.

 In fact, Hive doesn't support such join, as it is very difficult to
 express such conditions as a map/reduce job. However, sparksql supports
 such operation. So I would like to know how spark implement it.

 As I observe such join runs very slow, I guess that spark implement it by
 doing filter on the top of cartesian product. Is it true?

 Thanks in advance for your help.

 Cheers
 Gen





Questions about SparkSQL join on not equality conditions

2015-08-09 Thread gen tang
Hi,

I might have a stupid question about sparksql's implementation of join on
not equality conditions, for instance condition1 or condition2.

In fact, Hive doesn't support such join, as it is very difficult to express
such conditions as a map/reduce job. However, sparksql supports such
operation. So I would like to know how spark implement it.

As I observe such join runs very slow, I guess that spark implement it by
doing filter on the top of cartesian product. Is it true?

Thanks in advance for your help.

Cheers
Gen


Re: How to get total CPU consumption for Spark job

2015-08-07 Thread gen tang
Hi,

Spark UI or logs don't provide the situation of cluster. However, you can
use Ganglia to monitor the situation of cluster. In spark-ec2, there is an
option to install ganglia automatically.

If you use CDH, you can also use Cloudera manager.

Cheers
Gen


On Sat, Aug 8, 2015 at 6:06 AM, Xiao JIANG jiangxia...@outlook.com wrote:

 Hi all,


 I was running some Hive/spark job on hadoop cluster.  I want to see how
 spark helps improve not only the elapsed time but also the total CPU
 consumption.


 For Hive, I can get the 'Total MapReduce CPU Time Spent' from the log when
 the job finishes. But I didn't find any CPU stats for Spark jobs from
 either spark log or web UI. Is there any place I can find the total CPU
 consumption for my spark job? Thanks!


 Here is the version info: Spark version 1.3.0 Using Scala version 2.10.4,
 Java 1.7.0_67


 Thanks!

 Xiao



Re: Problems getting expected results from hbase_inputformat.py

2015-08-07 Thread gen tang
Hi,

In fact, Pyspark use
org.apache.spark.examples.pythonconverters(./examples/src/main/scala/org/apache/spark/pythonconverters/)
to transform object of Hbase result to python string.
Spark update these two scripts recently. However, they are not included in
the official release of spark. So you are trying to use this new python
script with old jar.

You can clone the newest code of spark from github and build examples jar.
Then you can get correct result.

Cheers
Gen


On Sat, Aug 8, 2015 at 5:03 AM, Eric Bless eric.bl...@yahoo.com.invalid
wrote:

 I’m having some difficulty getting the desired results from the Spark
 Python example hbase_inputformat.py. I’m running with CDH5.4, hbase Version
 1.0.0, Spark v 1.3.0 Using Python version 2.6.6

 I followed the example to create a test HBase table. Here’s the data from
 the table I created –
 hbase(main):001:0 scan 'dev_wx_test'
 ROW   COLUMN+CELL
 row1 column=f1:a, timestamp=1438716994027, value=value1
 row1 column=f1:b, timestamp=1438717004248, value=value2
 row2 column=f1:, timestamp=1438717014529, value=value3
 row3 column=f1:, timestamp=1438717022756, value=value4
 3 row(s) in 0.2620 seconds

 When either of these statements are included -
 “hbase_rdd = hbase_rdd.flatMapValues(lambda v: v.split(\n))”  or
 “hbase_rdd = hbase_rdd.flatMapValues(lambda v:
 v.split(\n)).countByValue().items()” the result is -
 We only get the following printed; (row1, value2) is not printed:
 ((u'row1', u'value1'), 1) ((u'row2', u'value3'), 1)
 ((u'row3', u'value4'), 1)
  This looks like similar results to the following post I found -

 http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-td18613.html#a18650
 but it appears the pythonconverter HBaseResultToStringConverter has been
 updated since then.

And this problem will be resolved too.



 When the statement
 “hbase_rdd = hbase_rdd.flatMapValues(lambda v:
 v.split(\n)).mapValues(json.loads)” is included, the result is –
 ValueError: No JSON object could be decoded


 **
 Here is more info on this from the log –
 Traceback (most recent call last):
   File hbase_inputformat.py, line 87, in module
 output = hbase_rdd.collect()
   File
 /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/rdd.py,
 line 701, in collect
   File
 /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/py4j/java_gateway.py,
 line 538, in __call__
   File
 /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/py4j/protocol.py,
 line 300, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling o44.collect.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task
 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage
 1.0 (TID 4, stluhdpddev27.monsanto.com):
 org.apache.spark.api.python.PythonException: Traceback (most recent call
 last):
   File
 /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/worker.py,
 line 101, in main
 process()
   File
 /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/worker.py,
 line 96, in process
 serializer.dump_stream(func(split_index, iterator), outfile)
   File
 /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/serializers.py,
 line 236, in dump_stream
 vs = list(itertools.islice(iterator, batch))
   File
 /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/rdd.py,
 line 1807, in lambda
   File /usr/lib64/python2.6/json/__init__.py, line 307, in loads
 return _default_decoder.decode(s)
   File /usr/lib64/python2.6/json/decoder.py, line 319, in decode
 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
   File /usr/lib64/python2.6/json/decoder.py, line 338, in raw_decode
 raise ValueError(No JSON object could be decoded)
 ValueError: No JSON object could be decoded

 at
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
 at
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:176)
 at
 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at
 

Re: Spark MLib v/s SparkR

2015-08-07 Thread gen tang
Hi,

It depends on the problem that you work on.
Just as python and R, Mllib focuses on machine learning and SparkR will
focus on statistics, if SparkR follow the way of R.

For instance, If you want to use glm to analyse data:
1. if you are interested only in parameters of model, and use this model to
predict, then you should use Mllib
2. if your focus is on confidence of the model, for example the confidence
interval of result and the significant level of parameters, you should
choose SparkR. However, as there is no glm package to this purpose yet, you
need to code it by yourself.

Hope it can be helpful

Cheers
Gen


On Thu, Aug 6, 2015 at 2:24 AM, praveen S mylogi...@gmail.com wrote:

 I was wondering when one should go for MLib or SparkR. What is the
 criteria or what should be considered before choosing either of the
 solutions for data analysis?
 or What is the advantages of Spark MLib over Spark R or advantages of
 SparkR over MLib?



Re: large scheduler delay in pyspark

2015-08-05 Thread gen tang
Hi,
Thanks a lot for your reply.


It seems that it is because of the slowness of the second code.
I rewrite code as list(set([i.items for i in a] + [i.items for i in b])).
The program returns normal.

By the way, I find that when the computation is running, UI will show
scheduler delay. However, it is not scheduler delay. When computation
finishes, UI will show correct scheduler delay time.

Cheers
Gen


On Tue, Aug 4, 2015 at 3:13 PM, Davies Liu dav...@databricks.com wrote:

 On Mon, Aug 3, 2015 at 9:00 AM, gen tang gen.tan...@gmail.com wrote:
  Hi,
 
  Recently, I met some problems about scheduler delay in pyspark. I worked
  several days on this problem, but not success. Therefore, I come to here
 to
  ask for help.
 
  I have a key_value pair rdd like rdd[(key, list[dict])] and I tried to
 merge
  value by adding two list
 
  if I do reduceByKey as follows:
 rdd.reduceByKey(lambda a, b: a+b)
  It works fine, scheduler delay is less than 10s. However if I do
  reduceByKey:
 def f(a, b):
 for i in b:
  if i not in a:
 a.append(i)
 return a
rdd.reduceByKey(f)

 Is it possible that you have large object that is also named `i` or `a` or
 `b`?

 Btw, the second one could be slow than first one, because you try to lookup
 a object in a list, that is O(N), especially when the object is large
 (dict).

  It will cause very large scheduler delay, about 15-20 mins.(The data I
 deal
  with is about 300 mb, and I use 5 machine with 32GB memory)

 If you see scheduler delay, it means there may be a large broadcast
 involved.

  I know the second code is not the same as the first. In fact, my purpose
 is
  to implement the second, but not work. So I try the first one.
  I don't know whether this is related to the data(with long string) or
 Spark
  on Yarn. But the first code works fine on the same data.
 
  Is there any way to find out the log when spark stall in scheduler delay,
  please? Or any ideas about this problem?
 
  Thanks a lot in advance for your help.
 
  Cheers
  Gen
 
 



large scheduler delay in pyspark

2015-08-03 Thread gen tang
Hi,

Recently, I met some problems about scheduler delay in pyspark. I worked
several days on this problem, but not success. Therefore, I come to here to
ask for help.

I have a key_value pair rdd like rdd[(key, list[dict])] and I tried to
merge value by adding two list

if I do reduceByKey as follows:
   rdd.reduceByKey(lambda a, b: a+b)
It works fine, scheduler delay is less than 10s. However if I do
reduceByKey:
   def f(a, b):
   for i in b:
if i not in a:
   a.append(i)
   return a
  rdd.reduceByKey(f)
It will cause very large scheduler delay, about 15-20 mins.(The data I deal
with is about 300 mb, and I use 5 machine with 32GB memory)

I know the second code is not the same as the first. In fact, my purpose is
to implement the second, but not work. So I try the first one.
I don't know whether this is related to the data(with long string) or Spark
on Yarn. But the first code works fine on the same data.

Is there any way to find out the log when spark stall in scheduler delay,
please? Or any ideas about this problem?

Thanks a lot in advance for your help.

Cheers
Gen


Strange behavoir of pyspark with --jars option

2015-07-15 Thread gen tang
Hi,

I met some interesting problems with --jars options
As I use the third party dependencies: elasticsearch-spark, I pass this jar
with the following command:
./bin/spark-submit --jars path-to-dependencies ...
It works well.
However, if I use HiveContext.sql, spark will lost the dependencies that I
passed.It seems that the execution of HiveContext will override the
configuration.(But if we check sparkContext._conf, the configuration is
unchanged)

But if I passed dependencies with --driver-class-path
and spark.executor.extraClassPath. The problem will disappear.

Is there anyone know why this interesting problem happens?

Thanks a lot for your help in advance.

Cheers
Gen


Re: pyspark hbase range scan

2015-04-02 Thread gen tang
Hi,

Maybe this might be helpful:
https://github.com/GenTang/spark_hbase/blob/master/src/main/scala/examples/pythonConverters.scala

Cheers
Gen

On Thu, Apr 2, 2015 at 1:50 AM, Eric Kimbrel eric.kimb...@soteradefense.com
 wrote:

 I am attempting to read an hbase table in pyspark with a range scan.

 conf = {
 hbase.zookeeper.quorum: host,
 hbase.mapreduce.inputtable: table,
 hbase.mapreduce.scan : scan
 }
 hbase_rdd = sc.newAPIHadoopRDD(
 org.apache.hadoop.hbase.mapreduce.TableInputFormat,
 org.apache.hadoop.hbase.io.ImmutableBytesWritable,
 org.apache.hadoop.hbase.client.Result,
 keyConverter=keyConv,
 valueConverter=valueConv,
 conf=conf)

 If i jump over to scala or java and generate a base64 encoded protobuf scan
 object and convert it to a string, i can use that value for
 hbase.mapreduce.scan and everything works,  the rdd will correctly
 perform
 the range scan and I am happy.  The problem is that I can not find any
 reasonable way to generate that range scan string in python.   The scala
 code required is:

 import org.apache.hadoop.hbase.util.Base64;
 import org.apache.hadoop.hbase.protobuf.ProtobufUtil;
 import org.apache.hadoop.hbase.client.{Delete, HBaseAdmin, HTable, Put,
 Result = HBaseResult, Scan}

 val scan = new Scan()
 scan.setStartRow(test_domain\0email.getBytes)
 scan.setStopRow(test_domain\0email~.getBytes)
 def scanToString(scan:Scan): String = { Base64.encodeBytes(
 ProtobufUtil.toScan(scan).toByteArray()) }
 scanToString(scan)


 Is there another way to perform an hbase range scan from pyspark or is that
 functionality something that might be supported in the future?




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-hbase-range-scan-tp22348.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Apache Spark ALS recommendations approach

2015-03-18 Thread gen tang
Hi,

If you do cartesian join to predict users' preference over all the
products, I think that 8 nodes with 64GB ram would not be enough for the
data.
Recently, I used als for a similar situation, but just 10M users and 0.1M
products, the minimum requirement is 9 nodes with 10GB RAM.
Moreover, even the program pass, the time of treatment will be very long.
Maybe you should try to reduce the set to predict for each client, as in
practice, you never need predict the preference of all products to make a
recommendation.

Hope this will be helpful.

Cheers
Gen


On Wed, Mar 18, 2015 at 12:13 PM, Aram Mkrtchyan 
aram.mkrtchyan...@gmail.com wrote:

 Trying to build recommendation system using Spark MLLib's ALS.

 Currently, we're trying to pre-build recommendations for all users on
 daily basis. We're using simple implicit feedbacks and ALS.

 The problem is, we have 20M users and 30M products, and to call the main
 predict() method, we need to have the cartesian join for users and
 products, which is too huge, and it may take days to generate only the
 join. Is there a way to avoid cartesian join to make the process faster?

 Currently we have 8 nodes with 64Gb of RAM, I think it should be enough
 for the data.

 val users: RDD[Int] = ???   // RDD with 20M userIds
 val products: RDD[Int] = ???// RDD with 30M productIds
 val ratings : RDD[Rating] = ??? // RDD with all user-product feedbacks

 val model = new ALS().setRank(10).setIterations(10)
   .setLambda(0.0001).setImplicitPrefs(true)
   .setAlpha(40).run(ratings)

 val usersProducts = users.cartesian(products)
 val recommendations = model.predict(usersProducts)




Re: Does anyone integrate HBASE on Spark

2015-03-04 Thread gen tang
Hi,

There are some examples in spark/example
https://github.com/apache/spark/tree/master/examples and there are also
some examples in spark package http://spark-packages.org/.
And I find this blog
http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html
is quite good.

Hope it would be helpful

Cheers
Gen


On Wed, Mar 4, 2015 at 6:51 PM, sandeep vura sandeepv...@gmail.com wrote:

 Hi Sparkers,

 How do i integrate hbase on spark !!!

 Appreciate for replies !!

 Regards,
 Sandeep.v



Re: Spark on EC2

2015-02-24 Thread gen tang
Hi,

I am sorry that I made a mistake on AWS tarif. You can read the email of
sean owen which explains better the strategies to run spark on AWS.

For your question: it means that you just download spark and unzip it. Then
run spark shell by ./bin/spark-shell or ./bin/pyspark. It is useful to get
familiar with spark. You can do this on your laptop as well as on ec2. In
fact, running ./ec2/spark-ec2 means launching spark standalone mode on a
cluster, you can find more details here:
https://spark.apache.org/docs/latest/spark-standalone.html

Cheers
Gen


On Tue, Feb 24, 2015 at 4:07 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Kindly bear with my questions as I am new to this.
  If you run spark on local mode on a ec2 machine
 What does this mean? Is it that I launch Spark cluster from my local
 machine,i.e., by running the shell script that is there in /spark/ec2?

 On Tue, Feb 24, 2015 at 8:32 PM, gen tang gen.tan...@gmail.com wrote:

 Hi,

 As a real spark cluster needs a least one master and one slaves, you need
 to launch two machine. Therefore the second machine is not free.
 However, If you run spark on local mode on a ec2 machine. It is free.

 The charge of AWS depends on how much and the types of machine that you
 launched, but not on the utilisation of machine.

 Hope it would help.

 Cheers
 Gen


 On Tue, Feb 24, 2015 at 3:55 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 Hi,
 I have just signed up for Amazon AWS because I learnt that it provides
 service for free for the first 12 months.
 I want to run Spark on EC2 cluster. Will they charge me for this?

 Thank You






Re: Spark on EC2

2015-02-24 Thread gen tang
Hi,

As a real spark cluster needs a least one master and one slaves, you need
to launch two machine. Therefore the second machine is not free.
However, If you run spark on local mode on a ec2 machine. It is free.

The charge of AWS depends on how much and the types of machine that you
launched, but not on the utilisation of machine.

Hope it would help.

Cheers
Gen


On Tue, Feb 24, 2015 at 3:55 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Hi,
 I have just signed up for Amazon AWS because I learnt that it provides
 service for free for the first 12 months.
 I want to run Spark on EC2 cluster. Will they charge me for this?

 Thank You



Re: Specifying AMI when using Spark EC-2 scripts

2015-02-15 Thread gen tang
Hi,

You can use -a or --ami your ami id to launch the cluster using specific
ami.
If I remember well, the default system is Amazon Linux.

Hope it will help

Cheers
Gen


On Sun, Feb 15, 2015 at 6:20 AM, olegshirokikh o...@solver.com wrote:

 Hi there,

 Is there a way to specify the AWS AMI with particular OS (say Ubuntu) when
 launching Spark on Amazon cloud with provided scripts?

 What is the default AMI, operating system that is launched by EC-2 script?

 Thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Specifying-AMI-when-using-Spark-EC-2-scripts-tp21658.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Loading JSON dataset with Spark Mllib

2015-02-15 Thread gen tang
Hi,

In fact, you can use sqlCtx.jsonFile() which loads a text file storing one
JSON object per line as a SchemaRDD.
Or you can use sc.textFile() to load the textFile to RDD and then use
sqlCtx.jsonRDD() which loads an RDD storing one JSON object per string as a
SchemaRDD.

Hope it could help
Cheers
Gen


On Mon, Feb 16, 2015 at 12:39 AM, pankaj channe pankajc...@gmail.com
wrote:

 Hi,

 I am new to spark and planning on writing a machine learning application
 with Spark mllib. My dataset is in json format. Is it possible to load data
 into spark without using any external json libraries? I have explored the
 option of SparkSql but I believe that is only for interactive use or
 loading data into hive tables.

 Thanks,
 Pankaj



Re: Installing a python library along with ec2 cluster

2015-02-09 Thread gen tang
Hi,

Please take a look at
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/creating-an-ami-ebs.html

Cheers
Gen



On Mon, Feb 9, 2015 at 6:41 AM, Chengi Liu chengi.liu...@gmail.com wrote:

 Hi I am very new both in spark and aws stuff..
 Say, I want to install pandas on ec2.. (pip install pandas)
 How do I create the image and the above library which would be used from
 pyspark.
 Thanks

 On Sun, Feb 8, 2015 at 3:03 AM, gen tang gen.tan...@gmail.com wrote:

 Hi,

 You can make a image of ec2 with all the python libraries installed and
 create a bash script to export python_path in the /etc/init.d/ directory.
 Then you can launch the cluster with this image and ec2.py

 Hope this can be helpful

 Cheers
 Gen


 On Sun, Feb 8, 2015 at 9:46 AM, Chengi Liu chengi.liu...@gmail.com
 wrote:

 Hi,
   I want to install couple of python libraries (pip install
 python_library) which I want to use on pyspark cluster which are developed
 using the ec2 scripts.
 Is there a way to specify these libraries when I am building those ec2
 clusters?
 Whats the best way to install these libraries on each ec2 node?
 Thanks






Re: Installing a python library along with ec2 cluster

2015-02-08 Thread gen tang
Hi,

You can make a image of ec2 with all the python libraries installed and
create a bash script to export python_path in the /etc/init.d/ directory.
Then you can launch the cluster with this image and ec2.py

Hope this can be helpful

Cheers
Gen


On Sun, Feb 8, 2015 at 9:46 AM, Chengi Liu chengi.liu...@gmail.com wrote:

 Hi,
   I want to install couple of python libraries (pip install
 python_library) which I want to use on pyspark cluster which are developed
 using the ec2 scripts.
 Is there a way to specify these libraries when I am building those ec2
 clusters?
 Whats the best way to install these libraries on each ec2 node?
 Thanks



Re: no space left at worker node

2015-02-08 Thread gen tang
Hi,

I fact, I met this problem before. it is a bug of AWS. Which type of
machine do you use?

If I guess well, you can check the file /etc/fstab. There would be a double
mount of /dev/xvdb.
If yes, you should
1. stop hdfs
2. umount /dev/xvdb at /
3. restart hdfs

Hope this could be helpful.
Cheers
Gen



On Sun, Feb 8, 2015 at 8:16 AM, ey-chih chow eyc...@hotmail.com wrote:

 Hi,

 I submitted a spark job to an ec2 cluster, using spark-submit.  At a worker
 node, there is an exception of 'no space left on device' as follows.

 ==
 15/02/08 01:53:38 ERROR logging.FileAppender: Error writing stream to file
 /root/spark/work/app-20150208014557-0003/0/stdout
 java.io.IOException: No space left on device
 at java.io.FileOutputStream.writeBytes(Native Method)
 at java.io.FileOutputStream.write(FileOutputStream.java:345)
 at

 org.apache.spark.util.logging.FileAppender.appendToFile(FileAppender.scala:92)
 at

 org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:72)
 at

 org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
 at

 org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
 at

 org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
 at
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
 at

 org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
 ===

 The command df showed the following information at the worker node:

 Filesystem   1K-blocks  Used Available Use% Mounted on
 /dev/xvda1 8256920   8256456 0 100% /
 tmpfs  7752012 0   7752012   0% /dev/shm
 /dev/xvdb 30963708   1729652  27661192   6% /mnt

 Does anybody know how to fix this?  Thanks.


 Ey-Chih Chow



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/no-space-left-at-worker-node-tp21545.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: no space left at worker node

2015-02-08 Thread gen tang
Hi,

In fact, /dev/sdb is /dev/xvdb. It seems that there is no problem about
double mount. However, there is no information about /mnt2. You should
check whether /dev/sdc is well mounted or not.
The reply of Micheal is good solution about this type of problem. You can
check his site.

Cheers
Gen


On Sun, Feb 8, 2015 at 5:53 PM, ey-chih chow eyc...@hotmail.com wrote:

 Gen,

 Thanks for your information.  The content of /etc/fstab at the worker node
 (r3.large) is:

 #
 LABEL=/ /   ext4defaults,noatime  1   1
 tmpfs   /dev/shmtmpfs   defaults0   0
 devpts  /dev/ptsdevpts  gid=5,mode=620  0   0
 sysfs   /syssysfs   defaults0   0
 proc/proc   procdefaults0   0
 /dev/sdb/mntauto
  defaults,noatime,nodiratime,comment=cloudconfig 0   0
 /dev/sdc/mnt2   auto
  defaults,noatime,nodiratime,comment=cloudconfig 0   0

 There is no entry of /dev/xvdb.

  Ey-Chih Chow

 --
 Date: Sun, 8 Feb 2015 12:09:37 +0100
 Subject: Re: no space left at worker node
 From: gen.tan...@gmail.com
 To: eyc...@hotmail.com
 CC: user@spark.apache.org


 Hi,

 I fact, I met this problem before. it is a bug of AWS. Which type of
 machine do you use?

 If I guess well, you can check the file /etc/fstab. There would be a
 double mount of /dev/xvdb.
 If yes, you should
 1. stop hdfs
 2. umount /dev/xvdb at /
 3. restart hdfs

 Hope this could be helpful.
 Cheers
 Gen



 On Sun, Feb 8, 2015 at 8:16 AM, ey-chih chow eyc...@hotmail.com wrote:

 Hi,

 I submitted a spark job to an ec2 cluster, using spark-submit.  At a worker
 node, there is an exception of 'no space left on device' as follows.

 ==
 15/02/08 01:53:38 ERROR logging.FileAppender: Error writing stream to file
 /root/spark/work/app-20150208014557-0003/0/stdout
 java.io.IOException: No space left on device
 at java.io.FileOutputStream.writeBytes(Native Method)
 at java.io.FileOutputStream.write(FileOutputStream.java:345)
 at

 org.apache.spark.util.logging.FileAppender.appendToFile(FileAppender.scala:92)
 at

 org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:72)
 at

 org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
 at

 org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
 at

 org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
 at
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
 at

 org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
 ===

 The command df showed the following information at the worker node:

 Filesystem   1K-blocks  Used Available Use% Mounted on
 /dev/xvda1 8256920   8256456 0 100% /
 tmpfs  7752012 0   7752012   0% /dev/shm
 /dev/xvdb 30963708   1729652  27661192   6% /mnt

 Does anybody know how to fix this?  Thanks.


 Ey-Chih Chow



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/no-space-left-at-worker-node-tp21545.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: no space left at worker node

2015-02-08 Thread gen tang
Hi,

I am sorry that I made a mistake. r3.large has only one SSD which has been
mounted in /mnt. Therefore this is no /dev/sdc.
In fact, the problem is that there is no space in the under / directory. So
you should check whether your application write data under this
directory(for instance, save file in file:///).

If not, you can use watch du -sh to during the running time to figure out
which directory is expanding. Normally, only /mnt directory which is
supported by SSD is expanding significantly, because the data of hdfs is
saved here. Then you can find the directory which caused no space problem
and find out the specific reason.

Cheers
Gen



On Sun, Feb 8, 2015 at 10:45 PM, ey-chih chow eyc...@hotmail.com wrote:

 Thanks Gen.  How can I check if /dev/sdc is well mounted or not?  In
 general, the problem shows up when I submit the second or third job.  The
 first job I submit most likely will succeed.

 Ey-Chih Chow

 --
 Date: Sun, 8 Feb 2015 18:18:03 +0100

 Subject: Re: no space left at worker node
 From: gen.tan...@gmail.com
 To: eyc...@hotmail.com
 CC: user@spark.apache.org

 Hi,

 In fact, /dev/sdb is /dev/xvdb. It seems that there is no problem about
 double mount. However, there is no information about /mnt2. You should
 check whether /dev/sdc is well mounted or not.
 The reply of Micheal is good solution about this type of problem. You can
 check his site.

 Cheers
 Gen


 On Sun, Feb 8, 2015 at 5:53 PM, ey-chih chow eyc...@hotmail.com wrote:

 Gen,

 Thanks for your information.  The content of /etc/fstab at the worker node
 (r3.large) is:

 #
 LABEL=/ /   ext4defaults,noatime  1   1
 tmpfs   /dev/shmtmpfs   defaults0   0
 devpts  /dev/ptsdevpts  gid=5,mode=620  0   0
 sysfs   /syssysfs   defaults0   0
 proc/proc   procdefaults0   0
 /dev/sdb/mntauto
  defaults,noatime,nodiratime,comment=cloudconfig 0   0
 /dev/sdc/mnt2   auto
  defaults,noatime,nodiratime,comment=cloudconfig 0   0

 There is no entry of /dev/xvdb.

  Ey-Chih Chow

 --
 Date: Sun, 8 Feb 2015 12:09:37 +0100
 Subject: Re: no space left at worker node
 From: gen.tan...@gmail.com
 To: eyc...@hotmail.com
 CC: user@spark.apache.org


 Hi,

 I fact, I met this problem before. it is a bug of AWS. Which type of
 machine do you use?

 If I guess well, you can check the file /etc/fstab. There would be a
 double mount of /dev/xvdb.
 If yes, you should
 1. stop hdfs
 2. umount /dev/xvdb at /
 3. restart hdfs

 Hope this could be helpful.
 Cheers
 Gen



 On Sun, Feb 8, 2015 at 8:16 AM, ey-chih chow eyc...@hotmail.com wrote:

 Hi,

 I submitted a spark job to an ec2 cluster, using spark-submit.  At a worker
 node, there is an exception of 'no space left on device' as follows.

 ==
 15/02/08 01:53:38 ERROR logging.FileAppender: Error writing stream to file
 /root/spark/work/app-20150208014557-0003/0/stdout
 java.io.IOException: No space left on device
 at java.io.FileOutputStream.writeBytes(Native Method)
 at java.io.FileOutputStream.write(FileOutputStream.java:345)
 at

 org.apache.spark.util.logging.FileAppender.appendToFile(FileAppender.scala:92)
 at

 org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:72)
 at

 org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
 at

 org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
 at

 org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
 at
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
 at

 org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
 ===

 The command df showed the following information at the worker node:

 Filesystem   1K-blocks  Used Available Use% Mounted on
 /dev/xvda1 8256920   8256456 0 100% /
 tmpfs  7752012 0   7752012   0% /dev/shm
 /dev/xvdb 30963708   1729652  27661192   6% /mnt

 Does anybody know how to fix this?  Thanks.


 Ey-Chih Chow



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/no-space-left-at-worker-node-tp21545.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: Pyspark Hbase scan.

2015-02-05 Thread gen tang
Hi,

In fact, this pull https://github.com/apache/spark/pull/3920 is to do Hbase
scan. However, it is not merged yet.
You can also take a look at the example code at
http://spark-packages.org/package/20 which is using scala and python to
read data from hbase.

Hope this can be helpful.

Cheers
Gen



On Thu, Feb 5, 2015 at 11:11 AM, Castberg, René Christian 
rene.castb...@dnvgl.com wrote:

  ​Hi,

 I am trying to do a hbase scan and read it into a spark rdd using pyspark.
 I have successfully written data to hbase from pyspark, and been able to
 read a full table from hbase using the python example code. Unfortunately I
 am unable to find any example code for doing an HBase scan and read it into
 a spark rdd from pyspark.

 I have found a scala example :

 http://stackoverflow.com/questions/25189527/how-to-process-a-range-of-hbase-rows-using-spark

 But i can't find anything on how to do this from python. Can anybody shed
 some light on how (and if) this can be done?​

  Regards

  Rene Castberg​




 **
 This e-mail and any attachments thereto may contain confidential
 information and/or information protected by intellectual property rights
 for the exclusive attention of the intended addressees named above. If you
 have received this transmission in error, please immediately notify the
 sender by return e-mail and delete this message and its attachments.
 Unauthorized use, copying or further full or partial distribution of this
 e-mail or its contents is prohibited.

 **



Re: Fail to launch spark-shell on windows 2008 R2

2015-01-29 Thread gen tang
Hi,

Using spark under windows is a really bad idea, because even you solve the
problems about hadoop, you probably will meet the problem of
java.net.SocketException. connection reset by peer. It is caused by the
fact we ask socket port too frequently under windows. In my knowledge, it
is really difficult to solve. And you will find something really funny: the
same code sometimes works and sometimes not, even in the shell mode.

And I am sorry but I don't see the interest to run spark under windows and
moreover using local file system in a business environment. Do you have a
cluster in windows?

FYI, I have used spark prebuilt on hadoop 1 under windows 7 and there is no
problem to launch, but have problem of java.net.SocketException. If you are
using spark prebuilt on hadoop 2, you should consider follow the solution
provided by https://issues.apache.org/jira/browse/SPARK-2356

Cheers
Gen



On Thu, Jan 29, 2015 at 5:54 PM, Wang, Ningjun (LNG-NPV) 
ningjun.w...@lexisnexis.com wrote:

  Install virtual box which run Linux? That does not help us. We have
 business reason to run it on Windows operating system, e.g. Windows 2008 R2.



 If anybody have done that, please give some advise on what version of
 spark, which version of Hadoop do you built spark against, etc…. Note that
 we only use local file system and do not have any hdfs file system at all.
 I don’t understand why spark generate so many error on Hadoop while we
 don’t even need hdfs.



 Ningjun





 *From:* gen tang [mailto:gen.tan...@gmail.com]
 *Sent:* Thursday, January 29, 2015 10:45 AM
 *To:* Wang, Ningjun (LNG-NPV)
 *Cc:* user@spark.apache.org
 *Subject:* Re: Fail to launch spark-shell on windows 2008 R2



 Hi,



 I tried to use spark under windows once. However the only solution that I
 found is to install virtualbox



 Hope this can help you.

 Best

 Gen





 On Thu, Jan 29, 2015 at 4:18 PM, Wang, Ningjun (LNG-NPV) 
 ningjun.w...@lexisnexis.com wrote:

 I deployed spark-1.1.0 on Windows 7 and was albe to launch the
 spark-shell. I then deploy it to windows 2008 R2 and launch the
 spark-shell, I got the error



 java.lang.RuntimeException: Error while running command to get file
 permissions : java.io.IOExceptio

 n: Cannot run program ls: CreateProcess error=2, The system cannot find
 the file specified

 at java.lang.ProcessBuilder.start(Unknown Source)

 at org.apache.hadoop.util.Shell.runCommand(Shell.java:200)

 at org.apache.hadoop.util.Shell.run(Shell.java:182)

 at
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)

 at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)

 at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)

 at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:710)

 at
 org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFil

 eSystem.java:443)

 at
 org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getPermission(RawLocalFileSyst

 em.java:418)







 Here is the detail output





 C:\spark-1.1.0\bin   spark-shell

 15/01/29 10:13:13 INFO SecurityManager: Changing view acls to:
 ningjun.wang,

 15/01/29 10:13:13 INFO SecurityManager: Changing modify acls to:
 ningjun.wang,

 15/01/29 10:13:13 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled;

 users with view permissions: Set(ningjun.wang, ); users with modify
 permissions: Set(ningjun.wang, )



 15/01/29 10:13:13 INFO HttpServer: Starting HTTP Server

 15/01/29 10:13:14 INFO Server: jetty-8.y.z-SNAPSHOT

 15/01/29 10:13:14 INFO AbstractConnector: Started
 SocketConnector@0.0.0.0:53692

 15/01/29 10:13:14 INFO Utils: Successfully started service 'HTTP class
 server' on port 53692.

 Failed to created SparkJLineReader: java.lang.NoClassDefFoundError: Could
 not initialize class org.f

 usesource.jansi.internal.Kernel32

 Falling back to SimpleReader.

 Welcome to

     __

  / __/__  ___ _/ /__

 _\ \/ _ \/ _ `/ __/  '_/

/___/ .__/\_,_/_/ /_/\_\   version 1.1.0

   /_/



 Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
 1.7.0_71)

 Type in expressions to have them evaluated.

 Type :help for more information.

 15/01/29 10:13:18 INFO SecurityManager: Changing view acls to:
 ningjun.wang,

 15/01/29 10:13:18 INFO SecurityManager: Changing modify acls to:
 ningjun.wang,

 15/01/29 10:13:18 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled;

 users with view permissions: Set(ningjun.wang, ); users with modify
 permissions: Set(ningjun.wang, )



 15/01/29 10:13:18 INFO Slf4jLogger: Slf4jLogger started

 15/01/29 10:13:18 INFO Remoting: Starting remoting

 15/01/29 10:13:19 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sparkDriver@L

 AB4-WIN01.pcc.lexisnexis.com:53705]

 15/01/29 10:13:19 INFO Remoting: Remoting now listens on addresses:
 [akka.tcp

Re: Fail to launch spark-shell on windows 2008 R2

2015-01-29 Thread gen tang
Hi,

I tried to use spark under windows once. However the only solution that I
found is to install virtualbox

Hope this can help you.
Best
Gen


On Thu, Jan 29, 2015 at 4:18 PM, Wang, Ningjun (LNG-NPV) 
ningjun.w...@lexisnexis.com wrote:

  I deployed spark-1.1.0 on Windows 7 and was albe to launch the
 spark-shell. I then deploy it to windows 2008 R2 and launch the
 spark-shell, I got the error



 java.lang.RuntimeException: Error while running command to get file
 permissions : java.io.IOExceptio

 n: Cannot run program ls: CreateProcess error=2, The system cannot find
 the file specified

 at java.lang.ProcessBuilder.start(Unknown Source)

 at org.apache.hadoop.util.Shell.runCommand(Shell.java:200)

 at org.apache.hadoop.util.Shell.run(Shell.java:182)

 at
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)

 at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)

 at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)

 at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:710)

 at
 org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFil

 eSystem.java:443)

 at
 org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getPermission(RawLocalFileSyst

 em.java:418)







 Here is the detail output





 C:\spark-1.1.0\bin   spark-shell

 15/01/29 10:13:13 INFO SecurityManager: Changing view acls to:
 ningjun.wang,

 15/01/29 10:13:13 INFO SecurityManager: Changing modify acls to:
 ningjun.wang,

 15/01/29 10:13:13 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled;

 users with view permissions: Set(ningjun.wang, ); users with modify
 permissions: Set(ningjun.wang, )



 15/01/29 10:13:13 INFO HttpServer: Starting HTTP Server

 15/01/29 10:13:14 INFO Server: jetty-8.y.z-SNAPSHOT

 15/01/29 10:13:14 INFO AbstractConnector: Started
 SocketConnector@0.0.0.0:53692

 15/01/29 10:13:14 INFO Utils: Successfully started service 'HTTP class
 server' on port 53692.

 Failed to created SparkJLineReader: java.lang.NoClassDefFoundError: Could
 not initialize class org.f

 usesource.jansi.internal.Kernel32

 Falling back to SimpleReader.

 Welcome to

     __

  / __/__  ___ _/ /__

 _\ \/ _ \/ _ `/ __/  '_/

/___/ .__/\_,_/_/ /_/\_\   version 1.1.0

   /_/



 Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
 1.7.0_71)

 Type in expressions to have them evaluated.

 Type :help for more information.

 15/01/29 10:13:18 INFO SecurityManager: Changing view acls to:
 ningjun.wang,

 15/01/29 10:13:18 INFO SecurityManager: Changing modify acls to:
 ningjun.wang,

 15/01/29 10:13:18 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled;

 users with view permissions: Set(ningjun.wang, ); users with modify
 permissions: Set(ningjun.wang, )



 15/01/29 10:13:18 INFO Slf4jLogger: Slf4jLogger started

 15/01/29 10:13:18 INFO Remoting: Starting remoting

 15/01/29 10:13:19 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sparkDriver@L

 AB4-WIN01.pcc.lexisnexis.com:53705]

 15/01/29 10:13:19 INFO Remoting: Remoting now listens on addresses:
 [akka.tcp://sparkDriver@LAB4-WIN

 01.pcc.lexisnexis.com:53705]

 15/01/29 10:13:19 INFO Utils: Successfully started service 'sparkDriver'
 on port 53705.

 15/01/29 10:13:19 INFO SparkEnv: Registering MapOutputTracker

 15/01/29 10:13:19 INFO SparkEnv: Registering BlockManagerMaster

 15/01/29 10:13:19 INFO DiskBlockManager: Created local directory at
 C:\Users\NINGJU~1.WAN\AppData\Lo

 cal\Temp\3\spark-local-20150129101319-f9da

 15/01/29 10:13:19 INFO Utils: Successfully started service 'Connection
 manager for block manager' on

 port 53708.

 15/01/29 10:13:19 INFO ConnectionManager: Bound socket to port 53708 with
 id = ConnectionManagerId(L

 AB4-WIN01.pcc.lexisnexis.com,53708)

 15/01/29 10:13:19 INFO MemoryStore: MemoryStore started with capacity
 265.4 MB

 15/01/29 10:13:19 INFO BlockManagerMaster: Trying to register BlockManager

 15/01/29 10:13:19 INFO BlockManagerMasterActor: Registering block manager
 LAB4-WIN01.pcc.lexisnexis.

 com:53708 with 265.4 MB RAM

 15/01/29 10:13:19 INFO BlockManagerMaster: Registered BlockManager

 15/01/29 10:13:19 INFO HttpFileServer: HTTP File server directory is
 C:\Users\NINGJU~1.WAN\AppData\L

 ocal\Temp\3\spark-2f65b1c3-00e2-489b-967c-4e1f41520583

 15/01/29 10:13:19 INFO HttpServer: Starting HTTP Server

 15/01/29 10:13:19 INFO Server: jetty-8.y.z-SNAPSHOT

 15/01/29 10:13:19 INFO AbstractConnector: Started
 SocketConnector@0.0.0.0:53709

 15/01/29 10:13:19 INFO Utils: Successfully started service 'HTTP file
 server' on port 53709.

 15/01/29 10:13:20 INFO Server: jetty-8.y.z-SNAPSHOT

 15/01/29 10:13:20 INFO AbstractConnector: Started
 SelectChannelConnector@0.0.0.0:4040

 15/01/29 10:13:20 INFO Utils: Successfully started service 'SparkUI' on
 

[documentation] Update the python example ALS of the site?

2015-01-27 Thread gen tang
Hi,

In the spark 1.2.0, it requires the ratings should be a RDD of Rating or
tuple or list. However, the current example in the site use still
RDD[array] as the ratings. Therefore, the example doesn't work under the
version 1.2.0.

May be we should update the documentation of the site?
Thanks a lot.

Cheers
Gen


Re: Cluster hangs in 'ssh-ready' state using Spark 1.2 EC2 launch script

2015-01-17 Thread gen tang
Hi,

This is because ssh-ready is the ec2 scripy means that all the instances
are in the status of running and all the instances in the status of OK,
In another word, the instances is ready to download and to install
software, just as emr is ready for bootstrap actions.
Before, the script just repeatedly prints the information showing that we
are waiting for every instance being launched.And it is quite ugly, so they
change the information to print
However, you can use ssh to connect the instance even if it is in the
status of pending. If you wait patiently a little more,, the script will
finish the launch of cluster.

Cheers
Gen


On Sat, Jan 17, 2015 at 7:00 PM, Nathan Murthy nathan.mur...@gmail.com
wrote:

 Originally posted here:
 http://stackoverflow.com/questions/28002443/cluster-hangs-in-ssh-ready-state-using-spark-1-2-ec2-launch-script

 I'm trying to launch a standalone Spark cluster using its pre-packaged EC2
 scripts, but it just indefinitely hangs in an 'ssh-ready' state:

 ubuntu@machine:~/spark-1.2.0-bin-hadoop2.4$ ./ec2/spark-ec2 -k
 key-pair -i identity-file.pem -r us-west-2 -s 3 launch test
 Setting up security groups...
 Searching for existing cluster test...
 Spark AMI: ami-ae6e0d9e
 Launching instances...
 Launched 3 slaves in us-west-2c, regid = r-b___6
 Launched master in us-west-2c, regid = r-0__0
 Waiting for all instances in cluster to enter 'ssh-ready'
 state..

 Yet I can SSH into these instances without compaint:

 ubuntu@machine:~$ ssh -i identity-file.pem root@master-ip
 Last login: Day MMM DD HH:mm:ss 20YY from
 c-AA-BBB--DDD.eee1.ff.provider.net

__|  __|_  )
_|  ( /   Amazon Linux AMI
   ___|\___|___|

 https://aws.amazon.com/amazon-linux-ami/2013.03-release-notes/
 There are 59 security update(s) out of 257 total update(s) available
 Run sudo yum update to apply all updates.
 Amazon Linux version 2014.09 is available.
 root@ip-internal ~]$

 I'm trying to figure out if this is a problem in AWS or with the Spark
 scripts. I've never had this issue before until recently.


 --
 Nathan Murthy // 713.884.7110 (mobile) // @natemurthy



Re: Did anyone tried overcommit of CPU cores?

2015-01-09 Thread gen tang
Hi,

As you said, the --executor-cores will define the max number of tasks that
an executor can take simultaneously. So, if you claim 10 cores, it is not
possible to launch more than 10 tasks in an executor at the same time.
According to my experience, set cores more than physical CPU core will
cause overload of CPU at some point of execution of spark application.
especially when you are using algorithm in mllib package. In addition, the
executor-cores will affect the default level of parallelism of spark.
Therefore, I recommend you to set cores = physical cores by default.
Moreover, I don't think overcommit cpu will increase the use of CPU. In my
opinion, it just increase the waiting queue of CPU.
If you observe the CPU load is very low (through ganglia for example) and
too much IO, maybe increasing level of parallelism or serializing your
object is a good choice.

Hoping this helps

Cheers
Gen


On Fri, Jan 9, 2015 at 10:12 AM, Xuelin Cao xuelincao2...@gmail.com wrote:


 Thanks, but, how to increase the tasks per core?

 For example, if the application claims 10 cores, is it possible to launch
 100 tasks concurrently?



 On Fri, Jan 9, 2015 at 2:57 PM, Jörn Franke jornfra...@gmail.com wrote:

 Hallo,

 Based on experiences with other software in virtualized environments I
 cannot really recommend this. However, I am not sure how Spark reacts. You
 may face unpredictable task failures depending on utilization, tasks
 connecting to external systems (databases etc.) may fail unexpectedly and
 this might be a problem for them (transactions not finishing etc.).

 Why not increase the tasks per core?

 Best regards
 Le 9 janv. 2015 06:46, Xuelin Cao xuelincao2...@gmail.com a écrit :


 Hi,

   I'm wondering whether it is a good idea to overcommit CPU cores on
 the spark cluster.

   For example, in our testing cluster, each worker machine has 24
 physical CPU cores. However, we are allowed to set the CPU core number to
 48 or more in the spark configuration file. As a result, we are allowed to
 launch more tasks than the number of physical CPU cores.

   The motivation of overcommit CPU cores is, for many times, a task
 cannot consume 100% resource of a single CPU core (due to I/O, shuffle,
 etc.).

   So, overcommit the CPU cores allows more tasks running at the same
 time, and makes the resource be used economically.

   But, is there any reason that we should not doing like this?
 Anyone tried this?

   [image: Inline image 1]






Re: Spark on teradata?

2015-01-08 Thread gen tang
Thanks a lot for your reply.
In fact, I need to work on almost all the data in teradata (~100T). So, I
don't think that jdbcRDD is a good choice.

Cheers
Gen


On Thu, Jan 8, 2015 at 7:39 PM, Reynold Xin r...@databricks.com wrote:

 Depending on your use cases. If the use case is to extract small amount of
 data out of teradata, then you can use the JdbcRDD and soon a jdbc input
 source based on the new Spark SQL external data source API.



 On Wed, Jan 7, 2015 at 7:14 AM, gen tang gen.tan...@gmail.com wrote:

 Hi,

 I have a stupid question:
 Is it possible to use spark on Teradata data warehouse, please? I read
 some news on internet which say yes. However, I didn't find any example
 about this issue

 Thanks in advance.

 Cheers
 Gen





Spark on teradata?

2015-01-07 Thread gen tang
Hi,

I have a stupid question:
Is it possible to use spark on Teradata data warehouse, please? I read some
news on internet which say yes. However, I didn't find any example about
this issue

Thanks in advance.

Cheers
Gen


Re: Spark Trainings/ Professional certifications

2015-01-07 Thread gen tang
Hi,

I am sorry to bother you, but I couldn't find any information about online
test of spark certification managed through Kryterion.
Could you please give me the link about it?
Thanks a lot in advance.

Cheers
Gen


On Wed, Jan 7, 2015 at 6:18 PM, Paco Nathan cet...@gmail.com wrote:

 Hi Saurabh,

 In your area, Big Data Partnership provides Spark training:
 http://www.bigdatapartnership.com/

 As Sean mentioned, there is a certification program via a partnership
 between O'Reilly Media and Databricks http://www.oreilly.com/go/sparkcert
  That is offered in two ways, in-person at events such as Strata + Hadoop
 World http://strataconf.com/  and also an online test managed through
 Kryterion. Sign up on the O'Reilly page.

 There are also two MOOCs starting soon on edX through University of
 California:

 Intro to Big Data with Apache Spark
 by Prof. Anthony Joseph, UC Berkeley
 begins Feb 23

 https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x#.VK1pkmTF8gk

 Scalable Machine Learning
 Prof. Ameet Talwalkar, UCLA
 begins Apr 14

 https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x#.VK1pkmTF8gk

 For coaching, arguably you might be best to talk with consultants
 especially for near-term needs. Contact me off-list and I can help provide
 intros in your area.

 Thanks,
 Paco


 On Wed, Jan 7, 2015 at 6:38 AM, Saurabh Agrawal 
 saurabh.agra...@markit.com wrote:


 Hi,



 Can you please suggest some of the best available trainings/ coaching
  and professional certifications in Apache Spark?



 We are trying to run predictive analysis on our Sales data and come out
 with recommendations (leads). We have tried to run CF but we end up getting
 absolutely bogus results!! A training that would leave us hands on to do
 our job effectively is what we are after. In addition to this, if this
 training could provide a firm ground for a professional certification, that
 would be an added advantage.



 Thanks for your inputs



 Regards,

 Saurabh Agrawal

 --
 This e-mail, including accompanying communications and attachments, is
 strictly confidential and only for the intended recipient. Any retention,
 use or disclosure not expressly authorised by Markit is prohibited. This
 email is subject to all waivers and other terms at the following link:
 http://www.markit.com/en/about/legal/email-disclaimer.page

 Please visit http://www.markit.com/en/about/contact/contact-us.page? for
 contact information on our offices worldwide.

 MarkitSERV Limited has its registered office located at Level 4,
 Ropemaker Place, 25 Ropemaker Street, London, EC2Y 9LY and is authorized
 and regulated by the Financial Conduct Authority with registration number
 207294





Re: Using ec2 launch script with locally built version of spark?

2015-01-06 Thread gen tang
Hi,

As the ec2 launch script provided by spark uses
https://github.com/mesos/spark-ec2 to download and configure all the tools
in the cluster (spark, hadoop etc). You can create your own git repository
to achieve your goal. More precisely:

1. Upload your own version of spark in s3 at address path to your spark
2. Fork https://github.com/mesos/spark-ec2 and make a change in
./spark/init.sh (add wget path to your spark)
3. Change line 638 in ec2 launch script: git clone your repository in
github

Hope this can be helpful.

Cheers
Gen


On Tue, Jan 6, 2015 at 11:51 PM, Ganon Pierce ganon.pie...@me.com wrote:

 Is there a way to use the ec2 launch script with a locally built version
 of spark? I launch and destroy clusters pretty frequently and would like to
 not have to wait each time for the master instance to compile the source as
 happens when I set the -v tag with the latest git commit. To be clear, I
 would like to launch a non-release version of spark compiled locally as
 quickly as I can launch a release version (e.g. -v 1.2.0) which does not
 have to be compiled upon launch.

 Up to this point, I have just used the launch script included with the
 latest release to set up the cluster and then manually replaced the
 assembly file on the master and slaves with the version I built locally and
 then stored on s3. Is there anything wrong with doing it this way? Further,
 is there a better or more standard way of accomplishing this?
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org