Re: how to set database in DataFrame.saveAsTable?

2016-02-20 Thread gen tang
Hi,

You can use
sqlContext.sql("use ")
before use dataframe.saveAsTable

Hope it could be helpful

Cheers
Gen


On Sun, Feb 21, 2016 at 9:55 AM, Glen <cng...@gmail.com> wrote:

> For dataframe in spark, so the table can be visited by hive.
>
> --
> Jacky Wang
>


Re: dataframe slow down with tungsten turn on

2015-11-04 Thread gen tang
Yes, the same code, the same result.
In fact, the code has been running for a more one month. Before 1.5.0, the
performance is quite the same, So I doubt that it is causd by tungsten.

Gen

On Wed, Nov 4, 2015 at 4:05 PM, Rick Moritz <rah...@gmail.com> wrote:

> Something to check (just in case):
> Are you getting identical results each time?
>
> On Wed, Nov 4, 2015 at 8:54 AM, gen tang <gen.tan...@gmail.com> wrote:
>
>> Hi sparkers,
>>
>> I am using dataframe to do some large ETL jobs.
>> More precisely, I create dataframe from HIVE table and do some
>> operations. And then I save it as json.
>>
>> When I used spark-1.4.1, the whole process is quite fast, about 1 mins.
>> However, when I use the same code with spark-1.5.1(with tungsten turn on),
>> it takes a about 2 hours to finish the same job.
>>
>> I checked the detail of tasks, almost all the time is consumed by
>> computation.
>>
>> Any idea about why this happens?
>>
>> Thanks a lot in advance for your help.
>>
>> Cheers
>> Gen
>>
>>
>


dataframe slow down with tungsten turn on

2015-11-03 Thread gen tang
Hi sparkers,

I am using dataframe to do some large ETL jobs.
More precisely, I create dataframe from HIVE table and do some operations.
And then I save it as json.

When I used spark-1.4.1, the whole process is quite fast, about 1 mins.
However, when I use the same code with spark-1.5.1(with tungsten turn on),
it takes a about 2 hours to finish the same job.

I checked the detail of tasks, almost all the time is consumed by
computation.

Any idea about why this happens?

Thanks a lot in advance for your help.

Cheers
Gen


Re: Spark works with the data in another cluster(Elasticsearch)

2015-08-25 Thread gen tang
Great advice.
Thanks a lot Nick.

In fact, if we use rdd.persist(DISK) command at the beginning of the
program to avoid hitting the network again and again. The speed is not
influenced a lot. In my case, it is just 1 min more compared to the
situation that we put the data in local HDFS.

Cheers
Gen

On Tue, Aug 25, 2015 at 6:26 PM, Nick Pentreath nick.pentre...@gmail.com
wrote:

 While it's true locality might speed things up, I'd say it's a very bad
 idea to mix your Spark and ES clusters - if your ES cluster is serving
 production queries (and in particular using aggregations), you'll run into
 performance issues on your production ES cluster.

 ES-hadoop uses ES scan  scroll to pull data pretty efficiently, so
 pulling it across the network is not too bad. If you do need to avoid that,
 pull the data and write what you need to HDFS as say parquet files (eg pull
 data daily and write it, then you have all data available on your Spark
 cluster).

 And of course ensure thatbwhen you do pull data from ES to Spark, you
 cache it to avoid hitting the network again

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Tue, Aug 25, 2015 at 12:01 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 If the data is local to the machine then obviously it will be faster
 compared to pulling it through the network and storing it locally (either
 memory or disk etc). Have a look at the data locality
 http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html
 .

 Thanks
 Best Regards

 On Tue, Aug 18, 2015 at 8:09 PM, gen tang gen.tan...@gmail.com wrote:

 Hi,

 Currently, I have my data in the cluster of Elasticsearch and I try to
 use spark to analyse those data.
 The cluster of Elasticsearch and the cluster of spark are two different
 clusters. And I use hadoop input format(es-hadoop) to read data in ES.

 I am wondering how this environment affect the speed of analysis.
 If I understand well, spark will read data from ES cluster and do
 calculate on its own cluster(include writing shuffle result on its own
 machine), Is this right? If this is correct, I think that the performance
 will just a little bit slower than the data stored on the same cluster.

 I will be appreciated if someone can share his/her experience about
 using spark with elasticsearch.

 Thanks a lot in advance for your help.

 Cheers
 Gen






Spark works with the data in another cluster(Elasticsearch)

2015-08-18 Thread gen tang
Hi,

Currently, I have my data in the cluster of Elasticsearch and I try to use
spark to analyse those data.
The cluster of Elasticsearch and the cluster of spark are two different
clusters. And I use hadoop input format(es-hadoop) to read data in ES.

I am wondering how this environment affect the speed of analysis.
If I understand well, spark will read data from ES cluster and do calculate
on its own cluster(include writing shuffle result on its own machine), Is
this right? If this is correct, I think that the performance will just a
little bit slower than the data stored on the same cluster.

I will be appreciated if someone can share his/her experience about using
spark with elasticsearch.

Thanks a lot in advance for your help.

Cheers
Gen


Re: Questions about SparkSQL join on not equality conditions

2015-08-11 Thread gen tang
Hi,

After taking a look at the code, I found out the problem:
As spark will use broadcastNestedLoopJoin to treat nonequality condition.
And one of my dataframe(df1) is created from an existing RDD(logicalRDD),
so it uses defaultSizeInBytes * length to estimate the size. The other
dataframe(df2) that I use is created from hive table(about 1G). Therefore
spark think df1 is larger than df2, although df1 is very small. As a
result, spark try to do df2.collect(), which causes the error.

Hope this could be helpful

Cheers
Gen


On Mon, Aug 10, 2015 at 11:29 PM, gen tang gen.tan...@gmail.com wrote:

 Hi,

 I am sorry to bother again.
 When I do join as follow:
 df = sqlContext.sql(selet a.someItem, b.someItem from a full outer join b
 on condition1 *or* condition2)
 df.first()

 The program failed at the result size is bigger than 
 spark.driver.maxResultSize.
 It is really strange, as one record is no way bigger than 1G.
 When I do join on just one condition or equity condition, there will be no
 problem.

 Could anyone help me, please?

 Thanks a lot in advance.

 Cheers
 Gen


 On Sun, Aug 9, 2015 at 9:08 PM, gen tang gen.tan...@gmail.com wrote:

 Hi,

 I might have a stupid question about sparksql's implementation of join on
 not equality conditions, for instance condition1 or condition2.

 In fact, Hive doesn't support such join, as it is very difficult to
 express such conditions as a map/reduce job. However, sparksql supports
 such operation. So I would like to know how spark implement it.

 As I observe such join runs very slow, I guess that spark implement it by
 doing filter on the top of cartesian product. Is it true?

 Thanks in advance for your help.

 Cheers
 Gen






Re: Questions about SparkSQL join on not equality conditions

2015-08-10 Thread gen tang
Hi,

I am sorry to bother again.
When I do join as follow:
df = sqlContext.sql(selet a.someItem, b.someItem from a full outer join b
on condition1 *or* condition2)
df.first()

The program failed at the result size is bigger than
spark.driver.maxResultSize.
It is really strange, as one record is no way bigger than 1G.
When I do join on just one condition or equity condition, there will be no
problem.

Could anyone help me, please?

Thanks a lot in advance.

Cheers
Gen


On Sun, Aug 9, 2015 at 9:08 PM, gen tang gen.tan...@gmail.com wrote:

 Hi,

 I might have a stupid question about sparksql's implementation of join on
 not equality conditions, for instance condition1 or condition2.

 In fact, Hive doesn't support such join, as it is very difficult to
 express such conditions as a map/reduce job. However, sparksql supports
 such operation. So I would like to know how spark implement it.

 As I observe such join runs very slow, I guess that spark implement it by
 doing filter on the top of cartesian product. Is it true?

 Thanks in advance for your help.

 Cheers
 Gen





Questions about SparkSQL join on not equality conditions

2015-08-09 Thread gen tang
Hi,

I might have a stupid question about sparksql's implementation of join on
not equality conditions, for instance condition1 or condition2.

In fact, Hive doesn't support such join, as it is very difficult to express
such conditions as a map/reduce job. However, sparksql supports such
operation. So I would like to know how spark implement it.

As I observe such join runs very slow, I guess that spark implement it by
doing filter on the top of cartesian product. Is it true?

Thanks in advance for your help.

Cheers
Gen


Re: How to get total CPU consumption for Spark job

2015-08-07 Thread gen tang
Hi,

Spark UI or logs don't provide the situation of cluster. However, you can
use Ganglia to monitor the situation of cluster. In spark-ec2, there is an
option to install ganglia automatically.

If you use CDH, you can also use Cloudera manager.

Cheers
Gen


On Sat, Aug 8, 2015 at 6:06 AM, Xiao JIANG jiangxia...@outlook.com wrote:

 Hi all,


 I was running some Hive/spark job on hadoop cluster.  I want to see how
 spark helps improve not only the elapsed time but also the total CPU
 consumption.


 For Hive, I can get the 'Total MapReduce CPU Time Spent' from the log when
 the job finishes. But I didn't find any CPU stats for Spark jobs from
 either spark log or web UI. Is there any place I can find the total CPU
 consumption for my spark job? Thanks!


 Here is the version info: Spark version 1.3.0 Using Scala version 2.10.4,
 Java 1.7.0_67


 Thanks!

 Xiao



Re: Problems getting expected results from hbase_inputformat.py

2015-08-07 Thread gen tang
Hi,

In fact, Pyspark use
org.apache.spark.examples.pythonconverters(./examples/src/main/scala/org/apache/spark/pythonconverters/)
to transform object of Hbase result to python string.
Spark update these two scripts recently. However, they are not included in
the official release of spark. So you are trying to use this new python
script with old jar.

You can clone the newest code of spark from github and build examples jar.
Then you can get correct result.

Cheers
Gen


On Sat, Aug 8, 2015 at 5:03 AM, Eric Bless eric.bl...@yahoo.com.invalid
wrote:

 I’m having some difficulty getting the desired results from the Spark
 Python example hbase_inputformat.py. I’m running with CDH5.4, hbase Version
 1.0.0, Spark v 1.3.0 Using Python version 2.6.6

 I followed the example to create a test HBase table. Here’s the data from
 the table I created –
 hbase(main):001:0 scan 'dev_wx_test'
 ROW   COLUMN+CELL
 row1 column=f1:a, timestamp=1438716994027, value=value1
 row1 column=f1:b, timestamp=1438717004248, value=value2
 row2 column=f1:, timestamp=1438717014529, value=value3
 row3 column=f1:, timestamp=1438717022756, value=value4
 3 row(s) in 0.2620 seconds

 When either of these statements are included -
 “hbase_rdd = hbase_rdd.flatMapValues(lambda v: v.split(\n))”  or
 “hbase_rdd = hbase_rdd.flatMapValues(lambda v:
 v.split(\n)).countByValue().items()” the result is -
 We only get the following printed; (row1, value2) is not printed:
 ((u'row1', u'value1'), 1) ((u'row2', u'value3'), 1)
 ((u'row3', u'value4'), 1)
  This looks like similar results to the following post I found -

 http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-td18613.html#a18650
 but it appears the pythonconverter HBaseResultToStringConverter has been
 updated since then.

And this problem will be resolved too.



 When the statement
 “hbase_rdd = hbase_rdd.flatMapValues(lambda v:
 v.split(\n)).mapValues(json.loads)” is included, the result is –
 ValueError: No JSON object could be decoded


 **
 Here is more info on this from the log –
 Traceback (most recent call last):
   File hbase_inputformat.py, line 87, in module
 output = hbase_rdd.collect()
   File
 /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/rdd.py,
 line 701, in collect
   File
 /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/py4j/java_gateway.py,
 line 538, in __call__
   File
 /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/py4j/protocol.py,
 line 300, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling o44.collect.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task
 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage
 1.0 (TID 4, stluhdpddev27.monsanto.com):
 org.apache.spark.api.python.PythonException: Traceback (most recent call
 last):
   File
 /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/worker.py,
 line 101, in main
 process()
   File
 /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/worker.py,
 line 96, in process
 serializer.dump_stream(func(split_index, iterator), outfile)
   File
 /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/serializers.py,
 line 236, in dump_stream
 vs = list(itertools.islice(iterator, batch))
   File
 /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/rdd.py,
 line 1807, in lambda
   File /usr/lib64/python2.6/json/__init__.py, line 307, in loads
 return _default_decoder.decode(s)
   File /usr/lib64/python2.6/json/decoder.py, line 319, in decode
 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
   File /usr/lib64/python2.6/json/decoder.py, line 338, in raw_decode
 raise ValueError(No JSON object could be decoded)
 ValueError: No JSON object could be decoded

 at
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
 at
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:176)
 at
 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at
 org.apache.spark.executor.Executor

Re: Spark MLib v/s SparkR

2015-08-07 Thread gen tang
Hi,

It depends on the problem that you work on.
Just as python and R, Mllib focuses on machine learning and SparkR will
focus on statistics, if SparkR follow the way of R.

For instance, If you want to use glm to analyse data:
1. if you are interested only in parameters of model, and use this model to
predict, then you should use Mllib
2. if your focus is on confidence of the model, for example the confidence
interval of result and the significant level of parameters, you should
choose SparkR. However, as there is no glm package to this purpose yet, you
need to code it by yourself.

Hope it can be helpful

Cheers
Gen


On Thu, Aug 6, 2015 at 2:24 AM, praveen S mylogi...@gmail.com wrote:

 I was wondering when one should go for MLib or SparkR. What is the
 criteria or what should be considered before choosing either of the
 solutions for data analysis?
 or What is the advantages of Spark MLib over Spark R or advantages of
 SparkR over MLib?



Re: large scheduler delay in pyspark

2015-08-05 Thread gen tang
Hi,
Thanks a lot for your reply.


It seems that it is because of the slowness of the second code.
I rewrite code as list(set([i.items for i in a] + [i.items for i in b])).
The program returns normal.

By the way, I find that when the computation is running, UI will show
scheduler delay. However, it is not scheduler delay. When computation
finishes, UI will show correct scheduler delay time.

Cheers
Gen


On Tue, Aug 4, 2015 at 3:13 PM, Davies Liu dav...@databricks.com wrote:

 On Mon, Aug 3, 2015 at 9:00 AM, gen tang gen.tan...@gmail.com wrote:
  Hi,
 
  Recently, I met some problems about scheduler delay in pyspark. I worked
  several days on this problem, but not success. Therefore, I come to here
 to
  ask for help.
 
  I have a key_value pair rdd like rdd[(key, list[dict])] and I tried to
 merge
  value by adding two list
 
  if I do reduceByKey as follows:
 rdd.reduceByKey(lambda a, b: a+b)
  It works fine, scheduler delay is less than 10s. However if I do
  reduceByKey:
 def f(a, b):
 for i in b:
  if i not in a:
 a.append(i)
 return a
rdd.reduceByKey(f)

 Is it possible that you have large object that is also named `i` or `a` or
 `b`?

 Btw, the second one could be slow than first one, because you try to lookup
 a object in a list, that is O(N), especially when the object is large
 (dict).

  It will cause very large scheduler delay, about 15-20 mins.(The data I
 deal
  with is about 300 mb, and I use 5 machine with 32GB memory)

 If you see scheduler delay, it means there may be a large broadcast
 involved.

  I know the second code is not the same as the first. In fact, my purpose
 is
  to implement the second, but not work. So I try the first one.
  I don't know whether this is related to the data(with long string) or
 Spark
  on Yarn. But the first code works fine on the same data.
 
  Is there any way to find out the log when spark stall in scheduler delay,
  please? Or any ideas about this problem?
 
  Thanks a lot in advance for your help.
 
  Cheers
  Gen
 
 



large scheduler delay in pyspark

2015-08-03 Thread gen tang
Hi,

Recently, I met some problems about scheduler delay in pyspark. I worked
several days on this problem, but not success. Therefore, I come to here to
ask for help.

I have a key_value pair rdd like rdd[(key, list[dict])] and I tried to
merge value by adding two list

if I do reduceByKey as follows:
   rdd.reduceByKey(lambda a, b: a+b)
It works fine, scheduler delay is less than 10s. However if I do
reduceByKey:
   def f(a, b):
   for i in b:
if i not in a:
   a.append(i)
   return a
  rdd.reduceByKey(f)
It will cause very large scheduler delay, about 15-20 mins.(The data I deal
with is about 300 mb, and I use 5 machine with 32GB memory)

I know the second code is not the same as the first. In fact, my purpose is
to implement the second, but not work. So I try the first one.
I don't know whether this is related to the data(with long string) or Spark
on Yarn. But the first code works fine on the same data.

Is there any way to find out the log when spark stall in scheduler delay,
please? Or any ideas about this problem?

Thanks a lot in advance for your help.

Cheers
Gen


Strange behavoir of pyspark with --jars option

2015-07-15 Thread gen tang
Hi,

I met some interesting problems with --jars options
As I use the third party dependencies: elasticsearch-spark, I pass this jar
with the following command:
./bin/spark-submit --jars path-to-dependencies ...
It works well.
However, if I use HiveContext.sql, spark will lost the dependencies that I
passed.It seems that the execution of HiveContext will override the
configuration.(But if we check sparkContext._conf, the configuration is
unchanged)

But if I passed dependencies with --driver-class-path
and spark.executor.extraClassPath. The problem will disappear.

Is there anyone know why this interesting problem happens?

Thanks a lot for your help in advance.

Cheers
Gen


Re: pyspark hbase range scan

2015-04-02 Thread gen tang
Hi,

Maybe this might be helpful:
https://github.com/GenTang/spark_hbase/blob/master/src/main/scala/examples/pythonConverters.scala

Cheers
Gen

On Thu, Apr 2, 2015 at 1:50 AM, Eric Kimbrel eric.kimb...@soteradefense.com
 wrote:

 I am attempting to read an hbase table in pyspark with a range scan.

 conf = {
 hbase.zookeeper.quorum: host,
 hbase.mapreduce.inputtable: table,
 hbase.mapreduce.scan : scan
 }
 hbase_rdd = sc.newAPIHadoopRDD(
 org.apache.hadoop.hbase.mapreduce.TableInputFormat,
 org.apache.hadoop.hbase.io.ImmutableBytesWritable,
 org.apache.hadoop.hbase.client.Result,
 keyConverter=keyConv,
 valueConverter=valueConv,
 conf=conf)

 If i jump over to scala or java and generate a base64 encoded protobuf scan
 object and convert it to a string, i can use that value for
 hbase.mapreduce.scan and everything works,  the rdd will correctly
 perform
 the range scan and I am happy.  The problem is that I can not find any
 reasonable way to generate that range scan string in python.   The scala
 code required is:

 import org.apache.hadoop.hbase.util.Base64;
 import org.apache.hadoop.hbase.protobuf.ProtobufUtil;
 import org.apache.hadoop.hbase.client.{Delete, HBaseAdmin, HTable, Put,
 Result = HBaseResult, Scan}

 val scan = new Scan()
 scan.setStartRow(test_domain\0email.getBytes)
 scan.setStopRow(test_domain\0email~.getBytes)
 def scanToString(scan:Scan): String = { Base64.encodeBytes(
 ProtobufUtil.toScan(scan).toByteArray()) }
 scanToString(scan)


 Is there another way to perform an hbase range scan from pyspark or is that
 functionality something that might be supported in the future?




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-hbase-range-scan-tp22348.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Apache Spark ALS recommendations approach

2015-03-18 Thread gen tang
Hi,

If you do cartesian join to predict users' preference over all the
products, I think that 8 nodes with 64GB ram would not be enough for the
data.
Recently, I used als for a similar situation, but just 10M users and 0.1M
products, the minimum requirement is 9 nodes with 10GB RAM.
Moreover, even the program pass, the time of treatment will be very long.
Maybe you should try to reduce the set to predict for each client, as in
practice, you never need predict the preference of all products to make a
recommendation.

Hope this will be helpful.

Cheers
Gen


On Wed, Mar 18, 2015 at 12:13 PM, Aram Mkrtchyan 
aram.mkrtchyan...@gmail.com wrote:

 Trying to build recommendation system using Spark MLLib's ALS.

 Currently, we're trying to pre-build recommendations for all users on
 daily basis. We're using simple implicit feedbacks and ALS.

 The problem is, we have 20M users and 30M products, and to call the main
 predict() method, we need to have the cartesian join for users and
 products, which is too huge, and it may take days to generate only the
 join. Is there a way to avoid cartesian join to make the process faster?

 Currently we have 8 nodes with 64Gb of RAM, I think it should be enough
 for the data.

 val users: RDD[Int] = ???   // RDD with 20M userIds
 val products: RDD[Int] = ???// RDD with 30M productIds
 val ratings : RDD[Rating] = ??? // RDD with all user-product feedbacks

 val model = new ALS().setRank(10).setIterations(10)
   .setLambda(0.0001).setImplicitPrefs(true)
   .setAlpha(40).run(ratings)

 val usersProducts = users.cartesian(products)
 val recommendations = model.predict(usersProducts)




Re: Does anyone integrate HBASE on Spark

2015-03-04 Thread gen tang
Hi,

There are some examples in spark/example
https://github.com/apache/spark/tree/master/examples and there are also
some examples in spark package http://spark-packages.org/.
And I find this blog
http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html
is quite good.

Hope it would be helpful

Cheers
Gen


On Wed, Mar 4, 2015 at 6:51 PM, sandeep vura sandeepv...@gmail.com wrote:

 Hi Sparkers,

 How do i integrate hbase on spark !!!

 Appreciate for replies !!

 Regards,
 Sandeep.v



Re: Spark on EC2

2015-02-24 Thread gen tang
Hi,

I am sorry that I made a mistake on AWS tarif. You can read the email of
sean owen which explains better the strategies to run spark on AWS.

For your question: it means that you just download spark and unzip it. Then
run spark shell by ./bin/spark-shell or ./bin/pyspark. It is useful to get
familiar with spark. You can do this on your laptop as well as on ec2. In
fact, running ./ec2/spark-ec2 means launching spark standalone mode on a
cluster, you can find more details here:
https://spark.apache.org/docs/latest/spark-standalone.html

Cheers
Gen


On Tue, Feb 24, 2015 at 4:07 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Kindly bear with my questions as I am new to this.
  If you run spark on local mode on a ec2 machine
 What does this mean? Is it that I launch Spark cluster from my local
 machine,i.e., by running the shell script that is there in /spark/ec2?

 On Tue, Feb 24, 2015 at 8:32 PM, gen tang gen.tan...@gmail.com wrote:

 Hi,

 As a real spark cluster needs a least one master and one slaves, you need
 to launch two machine. Therefore the second machine is not free.
 However, If you run spark on local mode on a ec2 machine. It is free.

 The charge of AWS depends on how much and the types of machine that you
 launched, but not on the utilisation of machine.

 Hope it would help.

 Cheers
 Gen


 On Tue, Feb 24, 2015 at 3:55 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 Hi,
 I have just signed up for Amazon AWS because I learnt that it provides
 service for free for the first 12 months.
 I want to run Spark on EC2 cluster. Will they charge me for this?

 Thank You






Re: Spark on EC2

2015-02-24 Thread gen tang
Hi,

As a real spark cluster needs a least one master and one slaves, you need
to launch two machine. Therefore the second machine is not free.
However, If you run spark on local mode on a ec2 machine. It is free.

The charge of AWS depends on how much and the types of machine that you
launched, but not on the utilisation of machine.

Hope it would help.

Cheers
Gen


On Tue, Feb 24, 2015 at 3:55 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Hi,
 I have just signed up for Amazon AWS because I learnt that it provides
 service for free for the first 12 months.
 I want to run Spark on EC2 cluster. Will they charge me for this?

 Thank You



Re: Specifying AMI when using Spark EC-2 scripts

2015-02-15 Thread gen tang
Hi,

You can use -a or --ami your ami id to launch the cluster using specific
ami.
If I remember well, the default system is Amazon Linux.

Hope it will help

Cheers
Gen


On Sun, Feb 15, 2015 at 6:20 AM, olegshirokikh o...@solver.com wrote:

 Hi there,

 Is there a way to specify the AWS AMI with particular OS (say Ubuntu) when
 launching Spark on Amazon cloud with provided scripts?

 What is the default AMI, operating system that is launched by EC-2 script?

 Thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Specifying-AMI-when-using-Spark-EC-2-scripts-tp21658.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Loading JSON dataset with Spark Mllib

2015-02-15 Thread gen tang
Hi,

In fact, you can use sqlCtx.jsonFile() which loads a text file storing one
JSON object per line as a SchemaRDD.
Or you can use sc.textFile() to load the textFile to RDD and then use
sqlCtx.jsonRDD() which loads an RDD storing one JSON object per string as a
SchemaRDD.

Hope it could help
Cheers
Gen


On Mon, Feb 16, 2015 at 12:39 AM, pankaj channe pankajc...@gmail.com
wrote:

 Hi,

 I am new to spark and planning on writing a machine learning application
 with Spark mllib. My dataset is in json format. Is it possible to load data
 into spark without using any external json libraries? I have explored the
 option of SparkSql but I believe that is only for interactive use or
 loading data into hive tables.

 Thanks,
 Pankaj



Re: Installing a python library along with ec2 cluster

2015-02-09 Thread gen tang
Hi,

Please take a look at
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/creating-an-ami-ebs.html

Cheers
Gen



On Mon, Feb 9, 2015 at 6:41 AM, Chengi Liu chengi.liu...@gmail.com wrote:

 Hi I am very new both in spark and aws stuff..
 Say, I want to install pandas on ec2.. (pip install pandas)
 How do I create the image and the above library which would be used from
 pyspark.
 Thanks

 On Sun, Feb 8, 2015 at 3:03 AM, gen tang gen.tan...@gmail.com wrote:

 Hi,

 You can make a image of ec2 with all the python libraries installed and
 create a bash script to export python_path in the /etc/init.d/ directory.
 Then you can launch the cluster with this image and ec2.py

 Hope this can be helpful

 Cheers
 Gen


 On Sun, Feb 8, 2015 at 9:46 AM, Chengi Liu chengi.liu...@gmail.com
 wrote:

 Hi,
   I want to install couple of python libraries (pip install
 python_library) which I want to use on pyspark cluster which are developed
 using the ec2 scripts.
 Is there a way to specify these libraries when I am building those ec2
 clusters?
 Whats the best way to install these libraries on each ec2 node?
 Thanks






Re: Installing a python library along with ec2 cluster

2015-02-08 Thread gen tang
Hi,

You can make a image of ec2 with all the python libraries installed and
create a bash script to export python_path in the /etc/init.d/ directory.
Then you can launch the cluster with this image and ec2.py

Hope this can be helpful

Cheers
Gen


On Sun, Feb 8, 2015 at 9:46 AM, Chengi Liu chengi.liu...@gmail.com wrote:

 Hi,
   I want to install couple of python libraries (pip install
 python_library) which I want to use on pyspark cluster which are developed
 using the ec2 scripts.
 Is there a way to specify these libraries when I am building those ec2
 clusters?
 Whats the best way to install these libraries on each ec2 node?
 Thanks



Re: no space left at worker node

2015-02-08 Thread gen tang
Hi,

I fact, I met this problem before. it is a bug of AWS. Which type of
machine do you use?

If I guess well, you can check the file /etc/fstab. There would be a double
mount of /dev/xvdb.
If yes, you should
1. stop hdfs
2. umount /dev/xvdb at /
3. restart hdfs

Hope this could be helpful.
Cheers
Gen



On Sun, Feb 8, 2015 at 8:16 AM, ey-chih chow eyc...@hotmail.com wrote:

 Hi,

 I submitted a spark job to an ec2 cluster, using spark-submit.  At a worker
 node, there is an exception of 'no space left on device' as follows.

 ==
 15/02/08 01:53:38 ERROR logging.FileAppender: Error writing stream to file
 /root/spark/work/app-20150208014557-0003/0/stdout
 java.io.IOException: No space left on device
 at java.io.FileOutputStream.writeBytes(Native Method)
 at java.io.FileOutputStream.write(FileOutputStream.java:345)
 at

 org.apache.spark.util.logging.FileAppender.appendToFile(FileAppender.scala:92)
 at

 org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:72)
 at

 org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
 at

 org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
 at

 org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
 at
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
 at

 org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
 ===

 The command df showed the following information at the worker node:

 Filesystem   1K-blocks  Used Available Use% Mounted on
 /dev/xvda1 8256920   8256456 0 100% /
 tmpfs  7752012 0   7752012   0% /dev/shm
 /dev/xvdb 30963708   1729652  27661192   6% /mnt

 Does anybody know how to fix this?  Thanks.


 Ey-Chih Chow



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/no-space-left-at-worker-node-tp21545.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: no space left at worker node

2015-02-08 Thread gen tang
Hi,

In fact, /dev/sdb is /dev/xvdb. It seems that there is no problem about
double mount. However, there is no information about /mnt2. You should
check whether /dev/sdc is well mounted or not.
The reply of Micheal is good solution about this type of problem. You can
check his site.

Cheers
Gen


On Sun, Feb 8, 2015 at 5:53 PM, ey-chih chow eyc...@hotmail.com wrote:

 Gen,

 Thanks for your information.  The content of /etc/fstab at the worker node
 (r3.large) is:

 #
 LABEL=/ /   ext4defaults,noatime  1   1
 tmpfs   /dev/shmtmpfs   defaults0   0
 devpts  /dev/ptsdevpts  gid=5,mode=620  0   0
 sysfs   /syssysfs   defaults0   0
 proc/proc   procdefaults0   0
 /dev/sdb/mntauto
  defaults,noatime,nodiratime,comment=cloudconfig 0   0
 /dev/sdc/mnt2   auto
  defaults,noatime,nodiratime,comment=cloudconfig 0   0

 There is no entry of /dev/xvdb.

  Ey-Chih Chow

 --
 Date: Sun, 8 Feb 2015 12:09:37 +0100
 Subject: Re: no space left at worker node
 From: gen.tan...@gmail.com
 To: eyc...@hotmail.com
 CC: user@spark.apache.org


 Hi,

 I fact, I met this problem before. it is a bug of AWS. Which type of
 machine do you use?

 If I guess well, you can check the file /etc/fstab. There would be a
 double mount of /dev/xvdb.
 If yes, you should
 1. stop hdfs
 2. umount /dev/xvdb at /
 3. restart hdfs

 Hope this could be helpful.
 Cheers
 Gen



 On Sun, Feb 8, 2015 at 8:16 AM, ey-chih chow eyc...@hotmail.com wrote:

 Hi,

 I submitted a spark job to an ec2 cluster, using spark-submit.  At a worker
 node, there is an exception of 'no space left on device' as follows.

 ==
 15/02/08 01:53:38 ERROR logging.FileAppender: Error writing stream to file
 /root/spark/work/app-20150208014557-0003/0/stdout
 java.io.IOException: No space left on device
 at java.io.FileOutputStream.writeBytes(Native Method)
 at java.io.FileOutputStream.write(FileOutputStream.java:345)
 at

 org.apache.spark.util.logging.FileAppender.appendToFile(FileAppender.scala:92)
 at

 org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:72)
 at

 org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
 at

 org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
 at

 org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
 at
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
 at

 org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
 ===

 The command df showed the following information at the worker node:

 Filesystem   1K-blocks  Used Available Use% Mounted on
 /dev/xvda1 8256920   8256456 0 100% /
 tmpfs  7752012 0   7752012   0% /dev/shm
 /dev/xvdb 30963708   1729652  27661192   6% /mnt

 Does anybody know how to fix this?  Thanks.


 Ey-Chih Chow



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/no-space-left-at-worker-node-tp21545.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: no space left at worker node

2015-02-08 Thread gen tang
Hi,

I am sorry that I made a mistake. r3.large has only one SSD which has been
mounted in /mnt. Therefore this is no /dev/sdc.
In fact, the problem is that there is no space in the under / directory. So
you should check whether your application write data under this
directory(for instance, save file in file:///).

If not, you can use watch du -sh to during the running time to figure out
which directory is expanding. Normally, only /mnt directory which is
supported by SSD is expanding significantly, because the data of hdfs is
saved here. Then you can find the directory which caused no space problem
and find out the specific reason.

Cheers
Gen



On Sun, Feb 8, 2015 at 10:45 PM, ey-chih chow eyc...@hotmail.com wrote:

 Thanks Gen.  How can I check if /dev/sdc is well mounted or not?  In
 general, the problem shows up when I submit the second or third job.  The
 first job I submit most likely will succeed.

 Ey-Chih Chow

 --
 Date: Sun, 8 Feb 2015 18:18:03 +0100

 Subject: Re: no space left at worker node
 From: gen.tan...@gmail.com
 To: eyc...@hotmail.com
 CC: user@spark.apache.org

 Hi,

 In fact, /dev/sdb is /dev/xvdb. It seems that there is no problem about
 double mount. However, there is no information about /mnt2. You should
 check whether /dev/sdc is well mounted or not.
 The reply of Micheal is good solution about this type of problem. You can
 check his site.

 Cheers
 Gen


 On Sun, Feb 8, 2015 at 5:53 PM, ey-chih chow eyc...@hotmail.com wrote:

 Gen,

 Thanks for your information.  The content of /etc/fstab at the worker node
 (r3.large) is:

 #
 LABEL=/ /   ext4defaults,noatime  1   1
 tmpfs   /dev/shmtmpfs   defaults0   0
 devpts  /dev/ptsdevpts  gid=5,mode=620  0   0
 sysfs   /syssysfs   defaults0   0
 proc/proc   procdefaults0   0
 /dev/sdb/mntauto
  defaults,noatime,nodiratime,comment=cloudconfig 0   0
 /dev/sdc/mnt2   auto
  defaults,noatime,nodiratime,comment=cloudconfig 0   0

 There is no entry of /dev/xvdb.

  Ey-Chih Chow

 --
 Date: Sun, 8 Feb 2015 12:09:37 +0100
 Subject: Re: no space left at worker node
 From: gen.tan...@gmail.com
 To: eyc...@hotmail.com
 CC: user@spark.apache.org


 Hi,

 I fact, I met this problem before. it is a bug of AWS. Which type of
 machine do you use?

 If I guess well, you can check the file /etc/fstab. There would be a
 double mount of /dev/xvdb.
 If yes, you should
 1. stop hdfs
 2. umount /dev/xvdb at /
 3. restart hdfs

 Hope this could be helpful.
 Cheers
 Gen



 On Sun, Feb 8, 2015 at 8:16 AM, ey-chih chow eyc...@hotmail.com wrote:

 Hi,

 I submitted a spark job to an ec2 cluster, using spark-submit.  At a worker
 node, there is an exception of 'no space left on device' as follows.

 ==
 15/02/08 01:53:38 ERROR logging.FileAppender: Error writing stream to file
 /root/spark/work/app-20150208014557-0003/0/stdout
 java.io.IOException: No space left on device
 at java.io.FileOutputStream.writeBytes(Native Method)
 at java.io.FileOutputStream.write(FileOutputStream.java:345)
 at

 org.apache.spark.util.logging.FileAppender.appendToFile(FileAppender.scala:92)
 at

 org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:72)
 at

 org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
 at

 org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
 at

 org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
 at
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
 at

 org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
 ===

 The command df showed the following information at the worker node:

 Filesystem   1K-blocks  Used Available Use% Mounted on
 /dev/xvda1 8256920   8256456 0 100% /
 tmpfs  7752012 0   7752012   0% /dev/shm
 /dev/xvdb 30963708   1729652  27661192   6% /mnt

 Does anybody know how to fix this?  Thanks.


 Ey-Chih Chow



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/no-space-left-at-worker-node-tp21545.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: Pyspark Hbase scan.

2015-02-05 Thread gen tang
Hi,

In fact, this pull https://github.com/apache/spark/pull/3920 is to do Hbase
scan. However, it is not merged yet.
You can also take a look at the example code at
http://spark-packages.org/package/20 which is using scala and python to
read data from hbase.

Hope this can be helpful.

Cheers
Gen



On Thu, Feb 5, 2015 at 11:11 AM, Castberg, René Christian 
rene.castb...@dnvgl.com wrote:

  ​Hi,

 I am trying to do a hbase scan and read it into a spark rdd using pyspark.
 I have successfully written data to hbase from pyspark, and been able to
 read a full table from hbase using the python example code. Unfortunately I
 am unable to find any example code for doing an HBase scan and read it into
 a spark rdd from pyspark.

 I have found a scala example :

 http://stackoverflow.com/questions/25189527/how-to-process-a-range-of-hbase-rows-using-spark

 But i can't find anything on how to do this from python. Can anybody shed
 some light on how (and if) this can be done?​

  Regards

  Rene Castberg​




 **
 This e-mail and any attachments thereto may contain confidential
 information and/or information protected by intellectual property rights
 for the exclusive attention of the intended addressees named above. If you
 have received this transmission in error, please immediately notify the
 sender by return e-mail and delete this message and its attachments.
 Unauthorized use, copying or further full or partial distribution of this
 e-mail or its contents is prohibited.

 **



Re: Fail to launch spark-shell on windows 2008 R2

2015-01-29 Thread gen tang
Hi,

Using spark under windows is a really bad idea, because even you solve the
problems about hadoop, you probably will meet the problem of
java.net.SocketException. connection reset by peer. It is caused by the
fact we ask socket port too frequently under windows. In my knowledge, it
is really difficult to solve. And you will find something really funny: the
same code sometimes works and sometimes not, even in the shell mode.

And I am sorry but I don't see the interest to run spark under windows and
moreover using local file system in a business environment. Do you have a
cluster in windows?

FYI, I have used spark prebuilt on hadoop 1 under windows 7 and there is no
problem to launch, but have problem of java.net.SocketException. If you are
using spark prebuilt on hadoop 2, you should consider follow the solution
provided by https://issues.apache.org/jira/browse/SPARK-2356

Cheers
Gen



On Thu, Jan 29, 2015 at 5:54 PM, Wang, Ningjun (LNG-NPV) 
ningjun.w...@lexisnexis.com wrote:

  Install virtual box which run Linux? That does not help us. We have
 business reason to run it on Windows operating system, e.g. Windows 2008 R2.



 If anybody have done that, please give some advise on what version of
 spark, which version of Hadoop do you built spark against, etc…. Note that
 we only use local file system and do not have any hdfs file system at all.
 I don’t understand why spark generate so many error on Hadoop while we
 don’t even need hdfs.



 Ningjun





 *From:* gen tang [mailto:gen.tan...@gmail.com]
 *Sent:* Thursday, January 29, 2015 10:45 AM
 *To:* Wang, Ningjun (LNG-NPV)
 *Cc:* user@spark.apache.org
 *Subject:* Re: Fail to launch spark-shell on windows 2008 R2



 Hi,



 I tried to use spark under windows once. However the only solution that I
 found is to install virtualbox



 Hope this can help you.

 Best

 Gen





 On Thu, Jan 29, 2015 at 4:18 PM, Wang, Ningjun (LNG-NPV) 
 ningjun.w...@lexisnexis.com wrote:

 I deployed spark-1.1.0 on Windows 7 and was albe to launch the
 spark-shell. I then deploy it to windows 2008 R2 and launch the
 spark-shell, I got the error



 java.lang.RuntimeException: Error while running command to get file
 permissions : java.io.IOExceptio

 n: Cannot run program ls: CreateProcess error=2, The system cannot find
 the file specified

 at java.lang.ProcessBuilder.start(Unknown Source)

 at org.apache.hadoop.util.Shell.runCommand(Shell.java:200)

 at org.apache.hadoop.util.Shell.run(Shell.java:182)

 at
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)

 at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)

 at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)

 at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:710)

 at
 org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFil

 eSystem.java:443)

 at
 org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getPermission(RawLocalFileSyst

 em.java:418)







 Here is the detail output





 C:\spark-1.1.0\bin   spark-shell

 15/01/29 10:13:13 INFO SecurityManager: Changing view acls to:
 ningjun.wang,

 15/01/29 10:13:13 INFO SecurityManager: Changing modify acls to:
 ningjun.wang,

 15/01/29 10:13:13 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled;

 users with view permissions: Set(ningjun.wang, ); users with modify
 permissions: Set(ningjun.wang, )



 15/01/29 10:13:13 INFO HttpServer: Starting HTTP Server

 15/01/29 10:13:14 INFO Server: jetty-8.y.z-SNAPSHOT

 15/01/29 10:13:14 INFO AbstractConnector: Started
 SocketConnector@0.0.0.0:53692

 15/01/29 10:13:14 INFO Utils: Successfully started service 'HTTP class
 server' on port 53692.

 Failed to created SparkJLineReader: java.lang.NoClassDefFoundError: Could
 not initialize class org.f

 usesource.jansi.internal.Kernel32

 Falling back to SimpleReader.

 Welcome to

     __

  / __/__  ___ _/ /__

 _\ \/ _ \/ _ `/ __/  '_/

/___/ .__/\_,_/_/ /_/\_\   version 1.1.0

   /_/



 Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
 1.7.0_71)

 Type in expressions to have them evaluated.

 Type :help for more information.

 15/01/29 10:13:18 INFO SecurityManager: Changing view acls to:
 ningjun.wang,

 15/01/29 10:13:18 INFO SecurityManager: Changing modify acls to:
 ningjun.wang,

 15/01/29 10:13:18 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled;

 users with view permissions: Set(ningjun.wang, ); users with modify
 permissions: Set(ningjun.wang, )



 15/01/29 10:13:18 INFO Slf4jLogger: Slf4jLogger started

 15/01/29 10:13:18 INFO Remoting: Starting remoting

 15/01/29 10:13:19 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sparkDriver@L

 AB4-WIN01.pcc.lexisnexis.com:53705]

 15/01/29 10:13:19 INFO Remoting: Remoting now listens on addresses:
 [akka.tcp

Re: Fail to launch spark-shell on windows 2008 R2

2015-01-29 Thread gen tang
Hi,

I tried to use spark under windows once. However the only solution that I
found is to install virtualbox

Hope this can help you.
Best
Gen


On Thu, Jan 29, 2015 at 4:18 PM, Wang, Ningjun (LNG-NPV) 
ningjun.w...@lexisnexis.com wrote:

  I deployed spark-1.1.0 on Windows 7 and was albe to launch the
 spark-shell. I then deploy it to windows 2008 R2 and launch the
 spark-shell, I got the error



 java.lang.RuntimeException: Error while running command to get file
 permissions : java.io.IOExceptio

 n: Cannot run program ls: CreateProcess error=2, The system cannot find
 the file specified

 at java.lang.ProcessBuilder.start(Unknown Source)

 at org.apache.hadoop.util.Shell.runCommand(Shell.java:200)

 at org.apache.hadoop.util.Shell.run(Shell.java:182)

 at
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)

 at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)

 at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)

 at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:710)

 at
 org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFil

 eSystem.java:443)

 at
 org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getPermission(RawLocalFileSyst

 em.java:418)







 Here is the detail output





 C:\spark-1.1.0\bin   spark-shell

 15/01/29 10:13:13 INFO SecurityManager: Changing view acls to:
 ningjun.wang,

 15/01/29 10:13:13 INFO SecurityManager: Changing modify acls to:
 ningjun.wang,

 15/01/29 10:13:13 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled;

 users with view permissions: Set(ningjun.wang, ); users with modify
 permissions: Set(ningjun.wang, )



 15/01/29 10:13:13 INFO HttpServer: Starting HTTP Server

 15/01/29 10:13:14 INFO Server: jetty-8.y.z-SNAPSHOT

 15/01/29 10:13:14 INFO AbstractConnector: Started
 SocketConnector@0.0.0.0:53692

 15/01/29 10:13:14 INFO Utils: Successfully started service 'HTTP class
 server' on port 53692.

 Failed to created SparkJLineReader: java.lang.NoClassDefFoundError: Could
 not initialize class org.f

 usesource.jansi.internal.Kernel32

 Falling back to SimpleReader.

 Welcome to

     __

  / __/__  ___ _/ /__

 _\ \/ _ \/ _ `/ __/  '_/

/___/ .__/\_,_/_/ /_/\_\   version 1.1.0

   /_/



 Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
 1.7.0_71)

 Type in expressions to have them evaluated.

 Type :help for more information.

 15/01/29 10:13:18 INFO SecurityManager: Changing view acls to:
 ningjun.wang,

 15/01/29 10:13:18 INFO SecurityManager: Changing modify acls to:
 ningjun.wang,

 15/01/29 10:13:18 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled;

 users with view permissions: Set(ningjun.wang, ); users with modify
 permissions: Set(ningjun.wang, )



 15/01/29 10:13:18 INFO Slf4jLogger: Slf4jLogger started

 15/01/29 10:13:18 INFO Remoting: Starting remoting

 15/01/29 10:13:19 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sparkDriver@L

 AB4-WIN01.pcc.lexisnexis.com:53705]

 15/01/29 10:13:19 INFO Remoting: Remoting now listens on addresses:
 [akka.tcp://sparkDriver@LAB4-WIN

 01.pcc.lexisnexis.com:53705]

 15/01/29 10:13:19 INFO Utils: Successfully started service 'sparkDriver'
 on port 53705.

 15/01/29 10:13:19 INFO SparkEnv: Registering MapOutputTracker

 15/01/29 10:13:19 INFO SparkEnv: Registering BlockManagerMaster

 15/01/29 10:13:19 INFO DiskBlockManager: Created local directory at
 C:\Users\NINGJU~1.WAN\AppData\Lo

 cal\Temp\3\spark-local-20150129101319-f9da

 15/01/29 10:13:19 INFO Utils: Successfully started service 'Connection
 manager for block manager' on

 port 53708.

 15/01/29 10:13:19 INFO ConnectionManager: Bound socket to port 53708 with
 id = ConnectionManagerId(L

 AB4-WIN01.pcc.lexisnexis.com,53708)

 15/01/29 10:13:19 INFO MemoryStore: MemoryStore started with capacity
 265.4 MB

 15/01/29 10:13:19 INFO BlockManagerMaster: Trying to register BlockManager

 15/01/29 10:13:19 INFO BlockManagerMasterActor: Registering block manager
 LAB4-WIN01.pcc.lexisnexis.

 com:53708 with 265.4 MB RAM

 15/01/29 10:13:19 INFO BlockManagerMaster: Registered BlockManager

 15/01/29 10:13:19 INFO HttpFileServer: HTTP File server directory is
 C:\Users\NINGJU~1.WAN\AppData\L

 ocal\Temp\3\spark-2f65b1c3-00e2-489b-967c-4e1f41520583

 15/01/29 10:13:19 INFO HttpServer: Starting HTTP Server

 15/01/29 10:13:19 INFO Server: jetty-8.y.z-SNAPSHOT

 15/01/29 10:13:19 INFO AbstractConnector: Started
 SocketConnector@0.0.0.0:53709

 15/01/29 10:13:19 INFO Utils: Successfully started service 'HTTP file
 server' on port 53709.

 15/01/29 10:13:20 INFO Server: jetty-8.y.z-SNAPSHOT

 15/01/29 10:13:20 INFO AbstractConnector: Started
 SelectChannelConnector@0.0.0.0:4040

 15/01/29 10:13:20 INFO Utils: Successfully started service 'SparkUI

[documentation] Update the python example ALS of the site?

2015-01-27 Thread gen tang
Hi,

In the spark 1.2.0, it requires the ratings should be a RDD of Rating or
tuple or list. However, the current example in the site use still
RDD[array] as the ratings. Therefore, the example doesn't work under the
version 1.2.0.

May be we should update the documentation of the site?
Thanks a lot.

Cheers
Gen


Re: Cluster hangs in 'ssh-ready' state using Spark 1.2 EC2 launch script

2015-01-17 Thread gen tang
Hi,

This is because ssh-ready is the ec2 scripy means that all the instances
are in the status of running and all the instances in the status of OK,
In another word, the instances is ready to download and to install
software, just as emr is ready for bootstrap actions.
Before, the script just repeatedly prints the information showing that we
are waiting for every instance being launched.And it is quite ugly, so they
change the information to print
However, you can use ssh to connect the instance even if it is in the
status of pending. If you wait patiently a little more,, the script will
finish the launch of cluster.

Cheers
Gen


On Sat, Jan 17, 2015 at 7:00 PM, Nathan Murthy nathan.mur...@gmail.com
wrote:

 Originally posted here:
 http://stackoverflow.com/questions/28002443/cluster-hangs-in-ssh-ready-state-using-spark-1-2-ec2-launch-script

 I'm trying to launch a standalone Spark cluster using its pre-packaged EC2
 scripts, but it just indefinitely hangs in an 'ssh-ready' state:

 ubuntu@machine:~/spark-1.2.0-bin-hadoop2.4$ ./ec2/spark-ec2 -k
 key-pair -i identity-file.pem -r us-west-2 -s 3 launch test
 Setting up security groups...
 Searching for existing cluster test...
 Spark AMI: ami-ae6e0d9e
 Launching instances...
 Launched 3 slaves in us-west-2c, regid = r-b___6
 Launched master in us-west-2c, regid = r-0__0
 Waiting for all instances in cluster to enter 'ssh-ready'
 state..

 Yet I can SSH into these instances without compaint:

 ubuntu@machine:~$ ssh -i identity-file.pem root@master-ip
 Last login: Day MMM DD HH:mm:ss 20YY from
 c-AA-BBB--DDD.eee1.ff.provider.net

__|  __|_  )
_|  ( /   Amazon Linux AMI
   ___|\___|___|

 https://aws.amazon.com/amazon-linux-ami/2013.03-release-notes/
 There are 59 security update(s) out of 257 total update(s) available
 Run sudo yum update to apply all updates.
 Amazon Linux version 2014.09 is available.
 root@ip-internal ~]$

 I'm trying to figure out if this is a problem in AWS or with the Spark
 scripts. I've never had this issue before until recently.


 --
 Nathan Murthy // 713.884.7110 (mobile) // @natemurthy



Re: Did anyone tried overcommit of CPU cores?

2015-01-09 Thread gen tang
Hi,

As you said, the --executor-cores will define the max number of tasks that
an executor can take simultaneously. So, if you claim 10 cores, it is not
possible to launch more than 10 tasks in an executor at the same time.
According to my experience, set cores more than physical CPU core will
cause overload of CPU at some point of execution of spark application.
especially when you are using algorithm in mllib package. In addition, the
executor-cores will affect the default level of parallelism of spark.
Therefore, I recommend you to set cores = physical cores by default.
Moreover, I don't think overcommit cpu will increase the use of CPU. In my
opinion, it just increase the waiting queue of CPU.
If you observe the CPU load is very low (through ganglia for example) and
too much IO, maybe increasing level of parallelism or serializing your
object is a good choice.

Hoping this helps

Cheers
Gen


On Fri, Jan 9, 2015 at 10:12 AM, Xuelin Cao xuelincao2...@gmail.com wrote:


 Thanks, but, how to increase the tasks per core?

 For example, if the application claims 10 cores, is it possible to launch
 100 tasks concurrently?



 On Fri, Jan 9, 2015 at 2:57 PM, Jörn Franke jornfra...@gmail.com wrote:

 Hallo,

 Based on experiences with other software in virtualized environments I
 cannot really recommend this. However, I am not sure how Spark reacts. You
 may face unpredictable task failures depending on utilization, tasks
 connecting to external systems (databases etc.) may fail unexpectedly and
 this might be a problem for them (transactions not finishing etc.).

 Why not increase the tasks per core?

 Best regards
 Le 9 janv. 2015 06:46, Xuelin Cao xuelincao2...@gmail.com a écrit :


 Hi,

   I'm wondering whether it is a good idea to overcommit CPU cores on
 the spark cluster.

   For example, in our testing cluster, each worker machine has 24
 physical CPU cores. However, we are allowed to set the CPU core number to
 48 or more in the spark configuration file. As a result, we are allowed to
 launch more tasks than the number of physical CPU cores.

   The motivation of overcommit CPU cores is, for many times, a task
 cannot consume 100% resource of a single CPU core (due to I/O, shuffle,
 etc.).

   So, overcommit the CPU cores allows more tasks running at the same
 time, and makes the resource be used economically.

   But, is there any reason that we should not doing like this?
 Anyone tried this?

   [image: Inline image 1]






Re: Spark on teradata?

2015-01-08 Thread gen tang
Thanks a lot for your reply.
In fact, I need to work on almost all the data in teradata (~100T). So, I
don't think that jdbcRDD is a good choice.

Cheers
Gen


On Thu, Jan 8, 2015 at 7:39 PM, Reynold Xin r...@databricks.com wrote:

 Depending on your use cases. If the use case is to extract small amount of
 data out of teradata, then you can use the JdbcRDD and soon a jdbc input
 source based on the new Spark SQL external data source API.



 On Wed, Jan 7, 2015 at 7:14 AM, gen tang gen.tan...@gmail.com wrote:

 Hi,

 I have a stupid question:
 Is it possible to use spark on Teradata data warehouse, please? I read
 some news on internet which say yes. However, I didn't find any example
 about this issue

 Thanks in advance.

 Cheers
 Gen





Spark on teradata?

2015-01-07 Thread gen tang
Hi,

I have a stupid question:
Is it possible to use spark on Teradata data warehouse, please? I read some
news on internet which say yes. However, I didn't find any example about
this issue

Thanks in advance.

Cheers
Gen


Re: Spark Trainings/ Professional certifications

2015-01-07 Thread gen tang
Hi,

I am sorry to bother you, but I couldn't find any information about online
test of spark certification managed through Kryterion.
Could you please give me the link about it?
Thanks a lot in advance.

Cheers
Gen


On Wed, Jan 7, 2015 at 6:18 PM, Paco Nathan cet...@gmail.com wrote:

 Hi Saurabh,

 In your area, Big Data Partnership provides Spark training:
 http://www.bigdatapartnership.com/

 As Sean mentioned, there is a certification program via a partnership
 between O'Reilly Media and Databricks http://www.oreilly.com/go/sparkcert
  That is offered in two ways, in-person at events such as Strata + Hadoop
 World http://strataconf.com/  and also an online test managed through
 Kryterion. Sign up on the O'Reilly page.

 There are also two MOOCs starting soon on edX through University of
 California:

 Intro to Big Data with Apache Spark
 by Prof. Anthony Joseph, UC Berkeley
 begins Feb 23

 https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x#.VK1pkmTF8gk

 Scalable Machine Learning
 Prof. Ameet Talwalkar, UCLA
 begins Apr 14

 https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x#.VK1pkmTF8gk

 For coaching, arguably you might be best to talk with consultants
 especially for near-term needs. Contact me off-list and I can help provide
 intros in your area.

 Thanks,
 Paco


 On Wed, Jan 7, 2015 at 6:38 AM, Saurabh Agrawal 
 saurabh.agra...@markit.com wrote:


 Hi,



 Can you please suggest some of the best available trainings/ coaching
  and professional certifications in Apache Spark?



 We are trying to run predictive analysis on our Sales data and come out
 with recommendations (leads). We have tried to run CF but we end up getting
 absolutely bogus results!! A training that would leave us hands on to do
 our job effectively is what we are after. In addition to this, if this
 training could provide a firm ground for a professional certification, that
 would be an added advantage.



 Thanks for your inputs



 Regards,

 Saurabh Agrawal

 --
 This e-mail, including accompanying communications and attachments, is
 strictly confidential and only for the intended recipient. Any retention,
 use or disclosure not expressly authorised by Markit is prohibited. This
 email is subject to all waivers and other terms at the following link:
 http://www.markit.com/en/about/legal/email-disclaimer.page

 Please visit http://www.markit.com/en/about/contact/contact-us.page? for
 contact information on our offices worldwide.

 MarkitSERV Limited has its registered office located at Level 4,
 Ropemaker Place, 25 Ropemaker Street, London, EC2Y 9LY and is authorized
 and regulated by the Financial Conduct Authority with registration number
 207294





Re: Using ec2 launch script with locally built version of spark?

2015-01-06 Thread gen tang
Hi,

As the ec2 launch script provided by spark uses
https://github.com/mesos/spark-ec2 to download and configure all the tools
in the cluster (spark, hadoop etc). You can create your own git repository
to achieve your goal. More precisely:

1. Upload your own version of spark in s3 at address path to your spark
2. Fork https://github.com/mesos/spark-ec2 and make a change in
./spark/init.sh (add wget path to your spark)
3. Change line 638 in ec2 launch script: git clone your repository in
github

Hope this can be helpful.

Cheers
Gen


On Tue, Jan 6, 2015 at 11:51 PM, Ganon Pierce ganon.pie...@me.com wrote:

 Is there a way to use the ec2 launch script with a locally built version
 of spark? I launch and destroy clusters pretty frequently and would like to
 not have to wait each time for the master instance to compile the source as
 happens when I set the -v tag with the latest git commit. To be clear, I
 would like to launch a non-release version of spark compiled locally as
 quickly as I can launch a release version (e.g. -v 1.2.0) which does not
 have to be compiled upon launch.

 Up to this point, I have just used the launch script included with the
 latest release to set up the cluster and then manually replaced the
 assembly file on the master and slaves with the version I built locally and
 then stored on s3. Is there anything wrong with doing it this way? Further,
 is there a better or more standard way of accomplishing this?
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: MLLib /ALS : java.lang.OutOfMemoryError: Java heap space

2014-12-16 Thread Gen
Hi,How many clients and how many products do you have?CheersGen
jaykatukuri wrote
 Hi all,I am running into an out of memory error while running ALS using
 MLLIB on a reasonably small data set consisting of around 6 Million
 ratings.The stack trace is below:java.lang.OutOfMemoryError: Java heap
 space at org.jblas.DoubleMatrix.

 (DoubleMatrix.java:323)   at
 org.jblas.DoubleMatrix.zeros(DoubleMatrix.java:471)   at
 org.jblas.DoubleMatrix.zeros(DoubleMatrix.java:476)   at
 org.apache.spark.mllib.recommendation.ALS$$anonfun$21.apply(ALS.scala:465)
 at
 org.apache.spark.mllib.recommendation.ALS$$anonfun$21.apply(ALS.scala:465)
 at scala.Array$.fill(Array.scala:267) at
 org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:465)
 at
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:445)
 at
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:444)
 at
 org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
 at
 org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)at
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:156)
 at
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at
 org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154) at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)   at
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)  at
 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
 at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)   at
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)  at
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)at
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)  at
 org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)   at
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)  at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
 at org.apache.spark.scheduler.Task.run(Task.scala:51) at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)I am
 using 2GB for executors memory.  I tried with 100 executors.Can some one
 please point me in the right direction ?Thanks,Jay





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-ALS-java-lang-OutOfMemoryError-Java-heap-space-tp20584p20714.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Why so many tasks?

2014-12-16 Thread Gen
Hi,

As you have 1,000 files, the RDD created by textFile will have 1,000
partitions. It is normal. In fact, as the same principal of HDFS, it is
better to store data with smaller number of files but larger size file. 

You can use data.coalesce(10) to solve this problem(it reduce the number of
partitions). 

Cheers
Gen



bethesda wrote
 Our job is creating what appears to be an inordinate number of very small
 tasks, which blow out our os inode and file limits.  Rather than
 continually upping those limits, we are seeking to understand whether our
 real problem is that too many tasks are running, perhaps because we are
 mis-configured or we are coding incorrectly.
 
 Rather than posting our actual code I have re-created the essence of the
 matter in the shell with a directory of files simulating the data we deal
 with.  We have three servers, each with 8G RAM.
 
 Given 1,000 files, each containing a string of 100 characters, in the
 myfiles directory:
 
 val data = sc.textFile(/user/foo/myfiles/*)
 
 val c = data.count
 
 The count operation produces 1,000 tasks.  Is this normal?
 
 val cart = data.cartesian(data)
 cart.count
 
 The cartesian operation produces 1M tasks.  I understand that the
 cartesian product of 1,000 items against itself is 1M, however, it seems
 the overhead of all this task creation and file I/O of all these tiny
 files outweighs the gains of distributed computing.  What am I missing
 here?
 
 Below is the truncated output of the count operation, if this helps
 indicate a configuration problem.
 
 Thank you.
 
 scala data.count
 14/12/16 07:40:46 INFO FileInputFormat: Total input paths to process :
 1000
 14/12/16 07:40:47 INFO SparkContext: Starting job: count at 
 console
 :15
 14/12/16 07:40:47 INFO DAGScheduler: Got job 0 (count at 
 console
 :15) with 1000 output partitions (allowLocal=false)
 14/12/16 07:40:47 INFO DAGScheduler: Final stage: Stage 0(count at 
 console
 :15)
 14/12/16 07:40:47 INFO DAGScheduler: Parents of final stage: List()
 14/12/16 07:40:47 INFO DAGScheduler: Missing parents: List()
 14/12/16 07:40:47 INFO DAGScheduler: Submitting Stage 0
 (/user/ds/randomfiles/* MappedRDD[3] at textFile at 
 console
 :12), which has no missing parents
 14/12/16 07:40:47 INFO MemoryStore: ensureFreeSpace(2400) called with
 curMem=507154, maxMem=278019440
 14/12/16 07:40:47 INFO MemoryStore: Block broadcast_2 stored as values in
 memory (estimated size 2.3 KB, free 264.7 MB)
 14/12/16 07:40:47 INFO MemoryStore: ensureFreeSpace(1813) called with
 curMem=509554, maxMem=278019440
 14/12/16 07:40:47 INFO MemoryStore: Block broadcast_2_piece0 stored as
 bytes in memory (estimated size 1813.0 B, free 264.7 MB)
 14/12/16 07:40:47 INFO BlockManagerInfo: Added broadcast_2_piece0 in
 memory on dev-myserver-1.abc.cloud:44041 (size: 1813.0 B, free: 265.1 MB)
 14/12/16 07:40:47 INFO BlockManagerMaster: Updated info of block
 broadcast_2_piece0
 14/12/16 07:40:47 INFO DAGScheduler: Submitting 1000 missing tasks from
 Stage 0 (/user/ds/randomfiles/* MappedRDD[3] at textFile at 
 console
 :12)
 14/12/16 07:40:47 INFO TaskSchedulerImpl: Adding task set 0.0 with 1000
 tasks
 14/12/16 07:40:47 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID
 0, dev-myserver-2.abc.cloud, NODE_LOCAL, 1202 bytes)
 14/12/16 07:40:47 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID
 1, dev-myserver-3.abc.cloud, NODE_LOCAL, 1201 bytes)
 14/12/16 07:40:47 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID
 2, dev-myserver-1.abc.cloud, NODE_LOCAL, 1203 bytes)
 14/12/16 07:40:47 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID
 3, dev-myserver-2.abc.cloud, NODE_LOCAL, 1203 bytes)
 14/12/16 07:40:47 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID
 4, dev-myserver-3.abc.cloud, NODE_LOCAL, 1204 bytes)
 14/12/16 07:40:47 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID
 5, dev-myserver-1.abc.cloud, NODE_LOCAL, 1203 bytes)
 14/12/16 07:40:47 INFO ConnectionManager: Accepted connection from
 [dev-myserver-3.abc.cloud/10.40.13.192:36133]
 14/12/16 07:40:47 INFO ConnectionManager: Accepted connection from
 [dev-myserver-2.abc.cloud/10.40.13.195:35716]
 14/12/16 07:40:47 INFO ConnectionManager: Accepted connection from
 [dev-myserver-1.abc.cloud/10.40.13.194:33728]
 14/12/16 07:40:47 INFO SendingConnection: Initiating connection to
 [dev-myserver-1.abc.cloud/10.40.13.194:49458]
 14/12/16 07:40:47 INFO SendingConnection: Initiating connection to
 [dev-myserver-3.abc.cloud/10.40.13.192:58579]
 14/12/16 07:40:47 INFO SendingConnection: Initiating connection to
 [dev-myserver-2.abc.cloud/10.40.13.195:52502]
 14/12/16 07:40:47 INFO SendingConnection: Connected to
 [dev-myserver-3.abc.cloud/10.40.13.192:58579], 1 messages pending
 14/12/16 07:40:47 INFO SendingConnection: Connected to
 [dev-myserver-1.abc.cloud/10.40.13.194:49458], 1 messages pending
 14/12/16 07:40:47 INFO SendingConnection: Connected to
 [dev-myserver-2.abc.cloud/10.40.13.195:52502], 1 messages pending
 14/12/16 07

Cannot pickle DecisionTreeModel in the pyspark

2014-12-12 Thread Gen
Hi everyone,

I am trying to save the decision tree model in python and I use
pickle.dump() to do this. However, it returns the following error
information:

/cPickle.UnpickleableError: Cannot pickle type 'thread.lock' objects/

I did some tests on the other model. It seems that decision tree model is
the only model in pyspark that we cannot pickle.

FYI: I use spark 1.1.1

Do you have any idea to solve this problem?(I dont know whether using scala
can solve this problem or not.)
Thanks a lot in advance for your help.

Cheers
Gen



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-pickle-DecisionTreeModel-in-the-pyspark-tp20661.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Does filter on an RDD scan every data item ?

2014-12-02 Thread Gen
Hi,

For your first question, I think that we can use
/sc.parallelize(rdd.take(1000))/

For your second question, I am not sure. But I don't think that we can
restricted filter within certain partition without scan every element.

Cheers
Gen


nsareen wrote
 Hi ,
 
 I wanted some clarity into the functioning of Filter function of RDD.
 
 1) Does filter function scan every element saved in RDD? if my RDD
 represents 10 Million rows, and if i want to work on only 1000 of them, is
 there an efficient way of filtering the subset without having to scan
 every element ?
 
 2) If my RDD represents a Key / Value data set. When i filter this data
 set of 10 Million rows, can i specify that the search should be restricted
 to only partitions which contain specific keys ? Will spark run by filter
 operation on all partitions if the partitions are done by key,
 irrespective the key exists in a partition or not ?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-tp20170p20174.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: --executor-cores cannot change vcores in yarn?

2014-11-03 Thread Gen
Hi,

Well, I doesn't find original documentation, but according to 
http://qnalist.com/questions/2791828/about-the-cpu-cores-and-cpu-usage
http://qnalist.com/questions/2791828/about-the-cpu-cores-and-cpu-usage  ,
the vcores is not for physics cpu core but for virtual cores. 
And I used top command to monitor the cpu utilization during the spark task.
The spark can use all cpu even I leave --executor-cores as default(1).

Hope that it can be a help.
Cheers
Gen


Gen wrote
 Hi,
 
 Maybe it is a stupid question, but I am running spark on yarn. I request
 the resources by the following command:
 {code}
 ./spark-submit --master yarn-client --num-executors #number of worker
 --executor-cores #number of cores. ...
 {code}
 However, after launching the task, I use 
/
 yarn node -status ID 
/
  to monitor the situation of cluster. It shows that the number of Vcores
 used for each container is always 1 no matter what number I pass by
 --executor-cores. 
 Any ideas how to solve this problem? Thanks a lot in advance for your
 help.
 
 Cheers
 Gen





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/executor-cores-cannot-change-vcores-in-yarn-tp17883p17992.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



--executor-cores cannot change vcores in yarn?

2014-11-01 Thread Gen
Hi,

Maybe it is a stupid question, but I am running spark on yarn. I request the
resources by the following command:
{code}
./spark-submit --master yarn-client --num-executors #number of worker
--executor-cores #number of cores. ...
{code}
However, after launching the task, I use /yarn node -status ID / to monitor
the situation of cluster. It shows that the number of Vcores used for each
container is always 1 no matter what number I pass by --executor-cores. 
Any ideas how to solve this problem? Thanks a lot in advance for your help.

Cheers
Gen




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/executor-cores-cannot-change-vcores-in-yarn-tp17883.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Executor and BlockManager memory size

2014-10-31 Thread Gen
Hi, 

I meet the same problem in the context of spark and yarn.
When I open pyspark with the following command:

  spark/bin/pyspark --master yarn-client --num-executors 1 --executor-memory
2500m

It turns out *INFO storage.BlockManagerMasterActor: Registering block
manager ip-10-0-6-171.us-west-2.compute.internal:38770 with 1294.1 MB RAM*
So, according to the documentation, just 2156.83m is allocated to executor.
Moreover, according to yarn 3072m memory is used for this container.

Do you have any ideas about this?
Thanks a lot 

Cheers
Gen


Boromir Widas wrote
 Hey Larry,
 
 I have been trying to figure this out for standalone clusters as well.
 http://apache-spark-user-list.1001560.n3.nabble.com/What-is-a-Block-Manager-td12833.html
 has an answer as to what block manager is for.
 
 From the documentation, what I understood was if you assign X GB to each
 executor, spark.storage.memoryFraction(default 0.6) * X is assigned to the
 BlockManager and the rest for the JVM itself(?).
 However, as you see, 26.8G is assigned to the BM, and assuming 0.6
 memoryFraction, this means the executor sees ~44.7G of memory, I am not
 sure what happens to the difference(5.3G).
 
 
 On Thu, Oct 9, 2014 at 9:40 PM, Larry Xiao lt;

 xiaodi@.edu

 gt; wrote:
 
 Hi all,

 I'm confused about Executor and BlockManager, why they have different
 memory.

 14/10/10 08:50:02 INFO AppClient$ClientActor: Executor added:
 app-20141010085001-/2 on worker-20141010004933-brick6-35657
 (brick6:35657) with 6 cores
 14/10/10 08:50:02 INFO SparkDeploySchedulerBackend: Granted executor ID
 app-20141010085001-/2 on hostPort brick6:35657 with 6 cores, 50.0 GB
 RAM

 14/10/10 08:50:07 INFO BlockManagerMasterActor: Registering block
 manager
 brick6:53296 with 26.8 GB RAM

 and on the WebUI,

 Executor IDAddressRDD Blocks  Memory UsedDisk UsedActive
 TasksFailed Tasks  Complete TasksTotal TasksTask Time   
 Input
   Shuffle ReadShuffle Write
 0brick3:3760700.0 B / 26.8 GB0.0 B60  06   
 0
 ms0.0 B0.0 B0.0 B
 1brick1:5949300.0 B / 26.8 GB0.0 B60  06   
 0
 ms0.0 B0.0 B0.0 B
 2brick6:5329600.0 B / 26.8 GB0.0 B60  06   
 0
 ms0.0 B0.0 B0.0 B
 3brick5:3854300.0 B / 26.8 GB0.0 B60  06   
 0
 ms0.0 B0.0 B0.0 B
 4brick2:4493700.0 B / 26.8 GB0.0 B60  06   
 0
 ms0.0 B0.0 B0.0 B
 5brick4:4679800.0 B / 26.8 GB0.0 B60  06   
 0
 ms0.0 B0.0 B0.0 B
 
 driver
 brick0:5769200.0 B / 274.6 MB0.0 B  000
   00 ms0.0 B0.0 B0.0 B

 As I understand it, a worker consist of a daemon and an executor, and
 executor takes charge both execution and storage.
 So does it mean that 26.8 GB is saved for storage and the rest is for
 execution?

 Another question is that, throughout execution, it seems that the
 blockmanager is always almost free.

 14/10/05 14:33:44 INFO BlockManagerInfo: Added broadcast_21_piece0 in
 memory on brick2:57501 (size: 1669.0 B, free: 21.2 GB)

 I don't know what I'm missing here.

 Best regards,
 Larry

 -
 To unsubscribe, e-mail: 

 user-unsubscribe@.apache

 For additional commands, e-mail: 

 user-help@.apache








--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Executor-and-BlockManager-memory-size-tp16091p17816.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Python code crashing on ReduceByKey if I return custom class object

2014-10-27 Thread Gen
Hi,

I think the problem is caused by Data Serialization. You can follow the link
https://spark.apache.org/docs/latest/tuning.html
https://spark.apache.org/docs/latest/tuning.html   to register your class
testing.

For pyspark 1.1.0, there is a problem about the default serializer. 
https://issues.apache.org/jira/browse/SPARK-2652?filter=-2
https://issues.apache.org/jira/browse/SPARK-2652?filter=-2  . 

Cheers.
Gen

 
sid wrote
 Hi , I am new to spark and I am trying to use pyspark.
 
 I am trying to find mean of 128 dimension vectors present in a file .
 
 Below is the code
 
 from cStringIO import StringIO
 
 class testing:
   def __str__(self):
   file_str = StringIO()
   for n in self.vector:
   file_str.write(str(n)) 
   file_str.write( )
   return file_str.getvalue()
   def __init__(self,txt=,initial=False):
   self.vector = [0.0]*128
   if len(txt)==0:
   return
   i=0
   for n in txt.split():
   if i128:
   self.vector[i]=float(n)
   i = i+1
   continue
   self.filename=n
   break
 def addVec(self,r):
   a = testing()
   for n in xrange(0,128):
   a.vector[n] = self.vector[n] + r.vector[n]
   return a
 
 def InitializeAndReturnPair(string,first=False):
   vec = testing(string,first)
   return 1,vec
 
 
 from pyspark import SparkConf, SparkContext
 conf = (SparkConf()
  .setMaster(local)
  .setAppName(My app)
  .set(spark.executor.memory, 1g))
 sc = SparkContext(conf = conf)
 
 inp = sc.textFile(input.txt)
 output = inp.map(lambda s: InitializeAndReturnPair(s,True)).cache()
 output.saveAsTextFile(output)
 print output.reduceByKey(lambda a,b : a).collect()
 -
 
 Example line in input.txt 
 6.0 156.0 26.0 3.0 1.0 0.0 2.0 1.0 15.0 113.0 53.0 139.0 156.0 0.0 0.0 0.0
 156.0 29.0 1.0 38.0 59.0 0.0 0.0 0.0 28.0 4.0 2.0 9.0 1.0 0.0 0.0 0.0 9.0
 83.0 13.0 1.0 0.0 9.0 42.0 7.0 41.0 71.0 74.0 123.0 35.0 17.0 7.0 2.0
 156.0 27.0 6.0 33.0 11.0 2.0 0.0 11.0 35.0 4.0 2.0 4.0 1.0 3.0 2.0 4.0 0.0
 0.0 0.0 0.0 2.0 19.0 45.0 17.0 47.0 2.0 2.0 7.0 59.0 90.0 15.0 11.0 156.0
 14.0 1.0 4.0 9.0 11.0 2.0 29.0 35.0 6.0 5.0 9.0 4.0 2.0 1.0 3.0 1.0 0.0
 0.0 0.0 1.0 5.0 25.0 14.0 27.0 2.0 0.0 2.0 86.0 48.0 10.0 6.0 156.0 23.0
 1.0 2.0 21.0 6.0 0.0 3.0 31.0 10.0 4.0 3.0 0.0 0.0 1.0 2.0
 
 I am not able to figure out where I am missing out , I tried changing the
 serializer but still getting similar error.
 
 Place the error here http://pastebin.com/0tqiiJQm





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Python-code-crashing-on-ReduceByKey-if-I-return-custom-class-object-tp17317p17345.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to aggregate data in Apach Spark

2014-10-20 Thread Gen
Hi,

I will write the code in python

{code:title=test.py}
data = sc.textFile(...).map(...) ## Please make sure that the rdd is
like[[id, c1, c2, c3], [id, c1, c2, c3],...]
keypair = data.map(lambda l: ((l[0],l[1],l[2]), float(l[3])))
keypair = keypair.reduceByKey(add)
out = keypair.map(lambda l: list(l[0]) + [l[1]])
{code}


Kalyan wrote
 I have a distribute system on 3 nodes and my dataset is distributed among
 those nodes. for example, I have a test.csv file which is exist on all 3
 nodes and it contains 4 columns of
 
 **row | id,  C1, C2,  C3
 --
 row1  | A1 , c1 , c2 ,2
 row2  | A1 , c1 , c2 ,1 
 row3  | A1 , c11, c2 ,1 
 row4  | A2 , c1 , c2 ,1 
 row5  | A2 , c1 , c2 ,1 
 row6  | A2 , c11, c2 ,1 
 row7  | A2 , c11, c21,1 
 row8  | A3 , c1 , c2 ,1
 row9  | A3 , c1 , c2 ,2
 row10 | A4 , c1 , c2 ,1
 
 I need help, how to aggregate data set by id, c1,c2,c3 columns and output
 like this
 
 **row | id,  C1, C2,  C3
 --
 row1  | A1 , c1 , c2 ,3
 row2  | A1 , c11, c2 ,1 
 row3  | A2 , c1 , c2 ,2 
 row4  | A2 , c11, c2 ,1 
 row5  | A2 , c11, c21,1 
 row6  | A3 , c1 , c2 ,3
 row7  | A4 , c1 , c2 ,1
 
 Thanks 
 Kalyan





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-aggregate-data-in-Apach-Spark-tp16764p16803.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: ALS implicit error pyspark

2014-10-20 Thread Gen
Hi, everyone, 

According to Xiangrui Meng(I think that he is the author of ALS), this
problem is caused by Kryo serialization:

/In PySpark 1.1, we switched to Kryo serialization by default. However, ALS
code requires special registration with Kryo in order to work. The error
happens when there is not enough memory and ALS needs to store ratings or
in/out blocks to disk. I will work on this issue./

And he provide the workaround for this problem: 

{code}
bin/pyspark --master local-cluster[2,1,512] --conf
'spark.kryo.registrator=org.apache.spark.examples.mllib.MovieLensALS$ALSRegistrator'
--jars lib/spark-examples-1.1.0-hadoop2.4.0.jar
{code}
Notice that you should replace master and the filename of the example jar to
match yours.

I created a topics in JIRA  https://issues.apache.org/jira/browse/SPARK-3990
https://issues.apache.org/jira/browse/SPARK-3990  . You can go to there
for more information or make a contribution to fix this problem.

Cheers
Gen




Gen wrote
 Hi,
 
 I am trying to use ALS.trainImplicit method in the
 pyspark.mllib.recommendation. However it didn't work. So I tried use the
 example in the python API documentation such as:
/
 r1 = (1, 1, 1.0) 
 r2 = (1, 2, 2.0) 
 r3 = (2, 1, 2.0) 
 ratings = sc.parallelize([r1, r2, r3]) 
 model = ALS.trainImplicit(ratings, 1) 
/
 
 It didn't work neither. After searching in google, I found that there are
 only two overloads for ALS.trainImplicit in the scala script. So I tried 
/
 model = ALS.trainImplicit(ratings, 1, 1)
/
 , it worked. But if I set the iterations other than 1,  
/
 model = ALS.trainImplicit(ratings, 1, 2)
/
  or 
/
 model = ALS.trainImplicit(ratings, 4, 2)
/
  for example, it generated error. The information is as follows:
 
 count at ALS.scala:314
 
 Job aborted due to stage failure: Task 6 in stage 189.0 failed 4 times,
 most recent failure: Lost task 6.3 in stage 189.0 (TID 626,
 ip-172-31-35-239.ec2.internal): com.esotericsoftware.kryo.KryoException:
 java.lang.ArrayStoreException: scala.collection.mutable.HashSet
 Serialization trace:
 shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)

 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)

 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)

 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)

 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)

 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)

 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)

 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)

 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)

 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
 org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 org.apache.spark.scheduler.Task.run(Task.scala:54)

 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

 java.util.concurrent.ThreadPoolExecutor$Worker.run

Re: ALS implicit error pyspark

2014-10-17 Thread Gen
Hi, 

Thanks a lot for your reply. It is true that python API has default
parameters except ranks(the default iterations is 5). At the very beginning,
in order to estimate the speed of ALS.trainImplicit, I used
ALS.trainImplicit(ratings, rank, 1) and it worked. So I tried ALS with more
iterations, for example, ALS.trainImplicit(ratings, rank, 10) and it didn't
work.

After several test, I found only iterations = 1 works for pyspark. But for
scala, all the value works.  

Best 
Gen


Davies Liu-2 wrote
 On Thu, Oct 16, 2014 at 9:53 AM, Gen lt;

 gen.tang86@

 gt; wrote:
 Hi,

 I am trying to use ALS.trainImplicit method in the
 pyspark.mllib.recommendation. However it didn't work. So I tried use the
 example in the python API documentation such as:

 /r1 = (1, 1, 1.0)
 r2 = (1, 2, 2.0)
 r3 = (2, 1, 2.0)
 ratings = sc.parallelize([r1, r2, r3])
 model = ALS.trainImplicit(ratings, 1) /

 It didn't work neither. After searching in google, I found that there are
 only two overloads for ALS.trainImplicit in the scala script. So I tried
 /model = ALS.trainImplicit(ratings, 1, 1)/, it worked. But if I set the
 iterations other than 1,  /model = ALS.trainImplicit(ratings, 1, 2)/ or
 /model = ALS.trainImplicit(ratings, 4, 2)/ for example, it generated
 error.
 The information is as follows:
 
 The Python API has default values for all other arguments, so you should
 call with only rank=1 (no default iterations in Scala).
 
 I'm curious that how can you meet this problem?
 
 count at ALS.scala:314

 Job aborted due to stage failure: Task 6 in stage 189.0 failed 4 times,
 most
 recent failure: Lost task 6.3 in stage 189.0 (TID 626,
 ip-172-31-35-239.ec2.internal): com.esotericsoftware.kryo.KryoException:
 java.lang.ArrayStoreException: scala.collection.mutable.HashSet
 Serialization trace:
 shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)

 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)

 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)

 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)

 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)

 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)

 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)

 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)

 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)

 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)

 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)

 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
 org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)

 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 org.apache.spark.scheduler.Task.run(Task.scala:54)

 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 Driver stacktrace:

 It is really strange, because count at ALS.scala:314 is already out the
 loop
 of iterations. Any idea?
 Thanks a lot for advance.

 FYI: I used spark 1.1.0 and ALS.train() works pretty well for all the
 cases.



 --
 View this message in context:
 http://apache-spark

Re: ALS implicit error pyspark

2014-10-17 Thread Gen
Hi, 

Today, I tried again with the following code, but it didn't work...
Could you please tell me your running environment?

/from pyspark.mllib.recommendation import ALS
from pyspark import SparkContext

sc = SparkContext() 
r1 = (1, 1, 1.0) 
r2 = (1, 2, 2.0) 
r3 = (2, 1, 2.0) 
ratings = sc.parallelize([r1, r2, r3]) 
model = ALS.trainImplicit(ratings, 1) /

I used spark-ec2 to create a 5 slaves cluster(I did some modifications on
spark_ec.py, but the cluster is well launched and configured). 
And I found that the task failed when one slave node try to take the second
task on count at ALS.scala:314 . I will take a look at the log and try to
find the problem.

Best
Gen


Davies Liu-2 wrote
 I can run the following code against Spark 1.1
 
 sc = SparkContext()
 r1 = (1, 1, 1.0)
 r2 = (1, 2, 2.0)
 r3 = (2, 1, 2.0)
 ratings = sc.parallelize([r1, r2, r3])
 model = ALS.trainImplicit(ratings, 1)
 
 Davies
 
 On Thu, Oct 16, 2014 at 2:45 PM, Davies Liu lt;

 davies@

 gt; wrote:
 Could you post the code that have problem with pyspark? thanks!

 Davies

 On Thu, Oct 16, 2014 at 12:27 PM, Gen lt;

 gen.tang86@

 gt; wrote:
 I tried the same data with scala. It works pretty well.
 It seems that it is the problem of pyspark.
 In the console, it shows the following logs:

 Traceback (most recent call last):
   File 
 stdin
 , line 1, in 
 module
 *  File /root/spark/python/pyspark/mllib/recommendation.py, line 76,
 in
 trainImplicit
 14/10/16 19:22:44 WARN scheduler.TaskSetManager: Lost task 4.3 in stage
 975.0 (TID 1653, ip-172-31-35-240.ec2.internal): TaskKilled (killed
 intentionally)
 ratingBytes._jrdd, rank, iterations, lambda_, blocks, alpha)*
   File
 /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
 line 538, in __call__
   File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
 line
 300, in get_return_value
 py4j.protocol.Py4JJavaError14/10/16 19:22:44 WARN
 scheduler.TaskSetManager:
 Lost task 8.2 in stage 975.0 (TID 1650, ip-172-31-35-241.ec2.internal):
 TaskKilled (killed intentionally)
 : An error occurred while calling o32.trainImplicitALSModel.
 : org.apache.spark.SparkException: Job aborted due to stage failure:
 Task 6
 in stage 975.0 failed 4 times, most recent failure: Lost task 6.3 in
 stage
 975.0 (TID 1651, ip-172-31-35-237.ec2.internal):
 com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException:
 scala.collection.mutable.HashSet
 Serialization trace:
 shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)

 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)

 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)

 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)

 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)

 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)

 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)

 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)

 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)

 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)

 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)

 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)

 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)

 org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)

 org.apache.spark.scheduler.ResultTask.runTask

Re: ALS implicit error pyspark

2014-10-17 Thread Gen
Hi,

I created an issue in JIRA. 
https://issues.apache.org/jira/browse/SPARK-3990
https://issues.apache.org/jira/browse/SPARK-3990  
I uploaded the error information in JIRA. Thanks in advance for your help.

Best
Gen


Davies Liu-2 wrote
 It seems a bug, Could you create a JIRA for it? thanks!
 
 Davies
 
 On Thu, Oct 16, 2014 at 12:27 PM, Gen lt;

 gen.tang86@

 gt; wrote:
 I tried the same data with scala. It works pretty well.
 It seems that it is the problem of pyspark.
 In the console, it shows the following logs:

 Traceback (most recent call last):
   File 
 stdin
 , line 1, in 
 module
 *  File /root/spark/python/pyspark/mllib/recommendation.py, line 76, in
 trainImplicit
 14/10/16 19:22:44 WARN scheduler.TaskSetManager: Lost task 4.3 in stage
 975.0 (TID 1653, ip-172-31-35-240.ec2.internal): TaskKilled (killed
 intentionally)
 ratingBytes._jrdd, rank, iterations, lambda_, blocks, alpha)*
   File
 /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
 line 538, in __call__
   File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
 line
 300, in get_return_value
 py4j.protocol.Py4JJavaError14/10/16 19:22:44 WARN
 scheduler.TaskSetManager:
 Lost task 8.2 in stage 975.0 (TID 1650, ip-172-31-35-241.ec2.internal):
 TaskKilled (killed intentionally)
 : An error occurred while calling o32.trainImplicitALSModel.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task
 6
 in stage 975.0 failed 4 times, most recent failure: Lost task 6.3 in
 stage
 975.0 (TID 1651, ip-172-31-35-237.ec2.internal):
 com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException:
 scala.collection.mutable.HashSet
 Serialization trace:
 shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)

 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)

 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)

 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)

 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)

 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)

 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)

 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)

 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)

 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)

 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)

 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
 org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)

 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 org.apache.spark.scheduler.Task.run(Task.scala:54)

 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 Driver stacktrace:
 at
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173

ALS implicit error pyspark

2014-10-16 Thread Gen
Hi,

I am trying to use ALS.trainImplicit method in the
pyspark.mllib.recommendation. However it didn't work. So I tried use the
example in the python API documentation such as:

/r1 = (1, 1, 1.0) 
r2 = (1, 2, 2.0) 
r3 = (2, 1, 2.0) 
ratings = sc.parallelize([r1, r2, r3]) 
model = ALS.trainImplicit(ratings, 1) /

It didn't work neither. After searching in google, I found that there are
only two overloads for ALS.trainImplicit in the scala script. So I tried
/model = ALS.trainImplicit(ratings, 1, 1)/, it worked. But if I set the
iterations other than 1,  /model = ALS.trainImplicit(ratings, 1, 2)/ or
/model = ALS.trainImplicit(ratings, 4, 2)/ for example, it generated error.
The information is as follows:

count at ALS.scala:314

Job aborted due to stage failure: Task 6 in stage 189.0 failed 4 times, most
recent failure: Lost task 6.3 in stage 189.0 (TID 626,
ip-172-31-35-239.ec2.internal): com.esotericsoftware.kryo.KryoException:
java.lang.ArrayStoreException: scala.collection.mutable.HashSet
Serialization trace:
shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
   
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
   
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
   
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)
   
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
   
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)
   
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
   
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
   
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
   
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   
org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   
org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
   
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:

It is really strange, because count at ALS.scala:314 is already out the loop
of iterations. Any idea?
Thanks a lot for advance.

FYI: I used spark 1.1.0 and ALS.train() works pretty well for all the cases.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ALS-implicit-error-pyspark-tp16595.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to make operation like cogrop() , groupbykey() on pair RDD = [ [ ], [ ] , [ ] ]

2014-10-16 Thread Gen
Hi,

You just need add list() in the sorted function. 
For example, 
map((lambda (x,y): (x, (list(y[0]), list(y[1],
sorted(list(rdd1.cogroup(rdd2).collect(


I think you just forget the list...

PS: your post has NOT been accepted by the mailing list yet.

Best 
Gen


pm wrote
 Hi ,
 
 Thanks for reply ,
 
 
 now after doing cogroup mentioned in below,
 
 merge_rdd = map((lambda (x,y): (x, (list(y[0]), list(y[1],
 sorted((rdd1.cogroup(rdd2).collect(
 
 map((lambda (x,y): (x, (list(y[0]), list(y[1],
 sorted((merge_rdd.cogroup(rdd3).collect(
 
 
 i m getting output like  
 
 
 [((u'abc', u'0010'),
   ([(
 pyspark.resultiterable.ResultIterable at 0x4b1b4d0
 ,
  
 pyspark.resultiterable.ResultIterable at 0x4b1b550
 )],
[[(u'address, u'2017 CAN'),
  (u'address_city', u'VESTAVIA '),
 ]])),
  ((u'abc', u'0020'),
   ([(
 pyspark.resultiterable.ResultIterable at 0x4b1bd50
 ,
  
 pyspark.resultiterable.ResultIterable at 0x4b1bf10
 )],
[[(u'address', u'2017 CAN'),
  (u'address_city', u'VESTAV'),
 ]]))]
 
 How to show value for object pyspark.resultiterable.ResultIterable at
 0x4b1b4d0.
 
 I want to show data for pyspark.resultiterable.ResultIterable at
 0x4b1bd50.
 
 
 Could please tell me the way to show data for those object . I m using
 python
 
 
 
 Thanks,





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-operation-like-cogrop-groupbykey-on-pair-RDD-tp16487p16598.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: ALS implicit error pyspark

2014-10-16 Thread Gen
)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

 14/10/16 19:22:44 WARN scheduler.TaskSetManager: Lost task 18.2 in stage
 975.0 (TID 1652, ip-172-31-35-241.ec2.internal): TaskKilled (killed
 intentionally)
14/10/16 19:22:44 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 975.0,
whose tasks have all completed, from pool




Gen wrote
 Hi,
 
 I am trying to use ALS.trainImplicit method in the
 pyspark.mllib.recommendation. However it didn't work. So I tried use the
 example in the python API documentation such as:
/
 r1 = (1, 1, 1.0) 
 r2 = (1, 2, 2.0) 
 r3 = (2, 1, 2.0) 
 ratings = sc.parallelize([r1, r2, r3]) 
 model = ALS.trainImplicit(ratings, 1) 
/
 
 It didn't work neither. After searching in google, I found that there are
 only two overloads for ALS.trainImplicit in the scala script. So I tried 
/
 model = ALS.trainImplicit(ratings, 1, 1)
/
 , it worked. But if I set the iterations other than 1,  
/
 model = ALS.trainImplicit(ratings, 1, 2)
/
  or 
/
 model = ALS.trainImplicit(ratings, 4, 2)
/
  for example, it generated error. The information is as follows:
 
 count at ALS.scala:314
 
 Job aborted due to stage failure: Task 6 in stage 189.0 failed 4 times,
 most recent failure: Lost task 6.3 in stage 189.0 (TID 626,
 ip-172-31-35-239.ec2.internal): com.esotericsoftware.kryo.KryoException:
 java.lang.ArrayStoreException: scala.collection.mutable.HashSet
 Serialization trace:
 shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)

 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)

 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)

 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)

 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)

 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)

 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)

 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)

 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)

 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
 org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
 org.apache.spark.rdd.RDD.iterator

Re: How to make operation like cogrop() , groupbykey() on pair RDD = [ [ ], [ ] , [ ] ]

2014-10-15 Thread Gen
What results do you want?

If your pair is like (a, b), where a is the key and b is the value, you
can try 
rdd1 = rdd1.flatMap(lambda l: l)
and then use cogroup.

Best
Gen



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-operation-like-cogrop-groupbykey-on-pair-RDD-tp16487p16489.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: S3 Bucket Access

2014-10-14 Thread Gen
Hi,

Are you sure that the id/key that you used can access to s3? You can try to
use the same id/key through python boto package to test it.

Because, I have almost the same situation as yours, but I can access to s3.

Best



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/S3-Bucket-Access-tp16303p16366.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL: select syntax

2014-10-14 Thread Gen
Hi,

I met the same problem before, and according to Matei Zaharia:
/The issue is that you're using SQLContext instead of HiveContext.
SQLContext implements a smaller subset of the SQL language and so you're
getting a SQL parse error because it doesn't support the syntax you have.
Look at how you'd write this in HiveQL, and then try doing that with
HiveContext./

In fact, there are more problems than that. The sparkSQL will conserve
(15+5=20) columns in the final table, if I remember well. Therefore, when
you are doing join on two tables which have the same columns will cause
doublecolumn error.

Cheers
Gen


Hao Ren wrote
 Update:
 
 This syntax is mainly for avoiding retyping column names.
 
 Let's take the example in my previous post, where 
*
 a
*
  is a table of 15 columns, 
*
 b
*
  has 5 columns, after a join, I have a table of (15 + 5 - 1(key in b)) =
 19 columns and register the table in sqlContext.
 
 I don't want to actually retype all the 19 columns' name when querying
 with select. This feature exists in hive.
 But in SparkSql, it gives an exception.
 
 Any ideas ? Thx
 
 Hao





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-select-syntax-tp16299p16367.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: S3 Bucket Access

2014-10-14 Thread Gen
Hi,

If I remember well, spark cannot use the IAMrole credentials to access to
s3. It use at first the id/key in the environment. If it is null in the
environment, it use the value in the file core-site.xml.  So, IAMrole is not
useful for spark. The same problem happens if you want to use distcp command
in hadoop.


Do you use curl http://169.254.169.254/latest/meta-data/iam/... to get the
temporary access. If yes, this code cannot use directly by spark, for more
information, you can take a look 
http://docs.aws.amazon.com/STS/latest/UsingSTS/using-temp-creds.html
http://docs.aws.amazon.com/STS/latest/UsingSTS/using-temp-creds.html  



sranga wrote
 Thanks for the pointers.
 I verified that the access key-id/secret used are valid. However, the
 secret may contain / at times. The issues I am facing are as follows:
 
- The EC2 instances are setup with an IAMRole () and don't have a
 static
key-id/secret
- All of the EC2 instances have access to S3 based on this role (I used
s3ls and s3cp commands to verify this)
- I can get a temporary access key-id/secret based on the IAMRole but
they generally expire in an hour
- If Spark is not able to use the IAMRole credentials, I may have to
generate a static key-id/secret. This may or may not be possible in the
environment I am in (from a policy perspective)
 
 
 
 - Ranga
 
 On Tue, Oct 14, 2014 at 4:21 AM, Rafal Kwasny lt;

 mag@

 gt; wrote:
 
 Hi,
 keep in mind that you're going to have a bad time if your secret key
 contains a /
 This is due to old and stupid hadoop bug:
 https://issues.apache.org/jira/browse/HADOOP-3733

 Best way is to regenerate the key so it does not include a /

 /Raf


 Akhil Das wrote:

 Try the following:

 1. Set the access key and secret key in the sparkContext:

 sparkContext.set(
 ​
 AWS_ACCESS_KEY_ID,yourAccessKey)

 sparkContext.set(
 ​
 AWS_SECRET_ACCESS_KEY,yourSecretKey)


 2. Set the access key and secret key in the environment before starting
 your application:

 ​

 export
 ​​
 AWS_ACCESS_KEY_ID=
 your access

 export
 ​​
 AWS_SECRET_ACCESS_KEY=
 your secret
 ​


 3. Set the access key and secret key inside the hadoop configurations

 val hadoopConf=sparkContext.hadoopConfiguration;

 hadoopConf.set(fs.s3.impl,
 org.apache.hadoop.fs.s3native.NativeS3FileSystem)

 hadoopConf.set(fs.s3.awsAccessKeyId,yourAccessKey)

 hadoopConf.set(fs.s3.awsSecretAccessKey,yourSecretKey)


 4. You can also try:

 val lines =

 ​s
 parkContext.textFile(s3n://yourAccessKey:yourSecretKey@
 
 yourBucket
 /path/)


 Thanks
 Best Regards

 On Mon, Oct 13, 2014 at 11:33 PM, Ranga lt;

 sranga@

 gt; wrote:

 Hi

 I am trying to access files/buckets in S3 and encountering a permissions
 issue. The buckets are configured to authenticate using an IAMRole
 provider.
 I have set the KeyId and Secret using environment variables (
 AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID). However, I am still unable
 to access the S3 buckets.

 Before setting the access key and secret the error was:
 java.lang.IllegalArgumentException:
 AWS Access Key ID and Secret Access Key must be specified as the
 username
 or password (respectively) of a s3n URL, or by setting the
 fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties
 (respectively).

 After setting the access key and secret, the error is: The AWS Access
 Key Id you provided does not exist in our records.

 The id/secret being set are the right values. This makes me believe that
 something else (token, etc.) needs to be set as well.
 Any help is appreciated.


 - Ranga









--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/S3-Bucket-Access-tp16303p16397.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL -- more than two tables for join

2014-10-07 Thread TANG Gen
Hi, the same problem happens when I try several joins together, such as
'SELECT * FROM sales INNER JOIN magasin ON sales.STO_KEY = magasin.STO_KEY
INNER JOIN eans ON (sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY =
eans.FORM_KEY)'

The error information is as follow: 
py4j.protocol.Py4JJavaError: An error occurred while calling o1229.sql.
: java.lang.RuntimeException: [1.269] failure: ``UNION'' expected but
`INNER' fo  
  
und

SELECT sales.Date AS Date, sales.ID_FOYER AS ID_FOYER, Sales.STO_KEY AS
STO_KEY,
 
sales.Quantite AS Quantite, sales.Prix AS Prix, sales.Total AS Total,
magasin.F   
 
ORM_KEY AS FORM_KEY, eans.UB_KEY AS UB_KEY FROM sales INNER JOIN magasin ON
sale

s.STO_KEY = magasin.STO_KEY INNER JOIN eans ON (sales.BARC_KEY =
eans.BARC_KEY a 
   
nd magasin.FORM_KEY = eans.FORM_KEY)







  
^
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:60)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:73)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:260)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.   

 
java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces   

 
sorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)


I have an impression that sparksql doesn't support more than two joins



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-more-than-two-tables-for-join-tp13865p15847.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL -- more than two tables for join

2014-10-07 Thread Gen
Hi, in fact, the same problem happens when I try several joins together:

SELECT * 
FROM sales INNER JOIN magasin ON sales.STO_KEY = magasin.STO_KEY 
INNER JOIN eans ON (sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY =
eans.FORM_KEY)

py4j.protocol.Py4JJavaError: An error occurred while calling o1229.sql.
: java.lang.RuntimeException: [1.269] failure: ``UNION'' expected but
`INNER' found

SELECT sales.Date AS Date, sales.ID_FOYER AS ID_FOYER, Sales.STO_KEY AS
STO_KEY,sales.Quantite AS Quantite, sales.Prix AS Prix, sales.Total AS
Total, magasin.FORM_KEY AS FORM_KEY, eans.UB_KEY AS UB_KEY FROM sales INNER
JOIN magasin ON sales.STO_KEY = magasin.STO_KEY INNER JOIN eans ON
(sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY = eans.FORM_KEY)

at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:60)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:73)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:260)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)

I use spark 1.1.0, so I have an impression that sparksql doesn't support
several joins together. 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-more-than-two-tables-for-join-tp13865p15848.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



The question about mount ephemeral disk in slave-setup.sh

2014-10-03 Thread TANG Gen
Hi,

I am quite a new user of spark, and I have a stupid question about mount
ephemeral disk for AWS EC2.

If I well understand the spark_ec.py script, it is spark-ec2/setup-slave.sh
that mounts the ephemeral disk for AWS EC2(Instance Store Volumes). However,
in setup-slave.sh, it seems that these disks are only mounted if the
instance begins with r3. 
For other instance types, are their ephemeral disk mounted or not? If yes,
which script mounts them or they are mounted automatically by AWS?

Thanks a lot for your help in advance. 

Best regards
Gen




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/The-question-about-mount-ephemeral-disk-in-slave-setup-sh-tp15675.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: The question about mount ephemeral disk in slave-setup.sh

2014-10-03 Thread TANG Gen
I have taken a look at the code of mesos spark-ec2 and documentation of AWS.
I think that maybe I found the answer. 

In fact, there are two types AMI in AWS EBS backed AMI and instance store
backed AMI. For EBS backed AMI, we can add instance store volume when we
create the images(The details can be founded in 
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/creating-an-ami-ebs.html
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/creating-an-ami-ebs.html 
). And then by default when we launch an instance from this AMI, the default
instance store volume will be formatted(ext3) and mounted at
/media/ephemeral0... etc 

The images provided by mesos spark-ec2 is EBS backed AMI and it is already
added instance store volume(I guess). However, it is modified the file
etc/fstab to mount the ephemeral disks to /mnt...etc (But I don't know how
they modify dynamically the file etc/fstab)

At last, as described in slave-setup.sh, for r3*, ext4 has the best
performance. Hence, they reformat the ephemeral disk to ext4 and mount it to
/mnt...etc.

Hope this could help someone else. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/The-question-about-mount-ephemeral-disk-in-slave-setup-sh-tp15675p15704.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Monitoring with Ganglia

2014-10-03 Thread TANG Gen
Maybe you can follow the instruction in this link 
https://github.com/mesos/spark-ec2/tree/v3/ganglia
https://github.com/mesos/spark-ec2/tree/v3/ganglia  . For me it works well



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Monitoring-with-Ganglia-tp15538p15705.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: pyspark on python 3

2014-10-03 Thread Gen
According to the official site of spark, for the latest version of
spark(1.1.0), it does not work with python 3

Spark 1.1.0 works with Python 2.6 or higher (but not Python 3). It uses the
standard CPython interpreter, so C libraries like NumPy can be used. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-on-python-3-tp15706p15707.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: partitions number with variable number of cores

2014-10-03 Thread Gen
Maybe I am wrong, but how many resource that a spark application can use
depends on the mode of deployment(the type of resource manager), you can
take a look at  https://spark.apache.org/docs/latest/job-scheduling.html
https://spark.apache.org/docs/latest/job-scheduling.html  . 

For your case, I think mesos is better which can realize the dynamic sharing
of CPU cores

Best



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/partitions-number-with-variable-number-of-cores-tp15367p15710.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org