from:"Nan Zhu"

broken UI in 2.3?

2018-03-05 Thread Nan Zhu

Hi, all

I am experiencing some issues in UI when using 2.3

when I clicked executor/storage tab, I got the following exception

java.lang.NullPointerException at
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
at
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
at
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
at
org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
at
org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
at
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:164)
at
org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
at
org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
at
org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at
org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at
org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at
org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
at
org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at
org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.spark_project.jetty.server.Server.handle(Server.java:534) at
org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:320) at
org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at
org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108)
at
org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at
org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at
org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at
org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at
org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at
org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)



Other versions of Spark is running fine in my environment, any idea on what
happened?

Best,

Nan

Re: Palantir replease under org.apache.spark?

2018-01-09 Thread Nan Zhu

nvm

On Tue, Jan 9, 2018 at 9:42 AM, Nan Zhu <zhunanmcg...@gmail.com> wrote:

> Hi, all
>
> Out of curious, I just found a bunch of Palantir release under
> org.apache.spark in maven central (https://mvnrepository.com/
> artifact/org.apache.spark/spark-core_2.11)?
>
> Is it on purpose?
>
> Best,
>
> Nan
>
>
>

Palantir replease under org.apache.spark?

2018-01-09 Thread Nan Zhu

Hi, all

Out of curious, I just found a bunch of Palantir release under
org.apache.spark in maven central (
https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11)?

Is it on purpose?

Best,

Nan

Re: --jars does not take remote jar?

2017-05-02 Thread Nan Zhu

I see.Thanks!

On Tue, May 2, 2017 at 9:12 AM, Marcelo Vanzin <van...@cloudera.com> wrote:

> On Tue, May 2, 2017 at 9:07 AM, Nan Zhu <zhunanmcg...@gmail.com> wrote:
> > I have no easy way to pass jar path to those forked Spark
> > applications? (except that I download jar from a remote path to a local
> temp
> > dir after resolving some permission issues, etc.?)
>
> Yes, that's the only way currently in client mode.
>
> --
> Marcelo
>

Re: --jars does not take remote jar?

2017-05-02 Thread Nan Zhu

Thanks for the reply! If I have an application master which starts some
Spark applications by forking processes (in yarn-client mode)

Essentially I have no easy way to pass jar path to those forked Spark
applications? (except that I download jar from a remote path to a local
temp dir after resolving some permission issues, etc.?)

On Tue, May 2, 2017 at 9:00 AM, Marcelo Vanzin <van...@cloudera.com> wrote:

> Remote jars are added to executors' classpaths, but not the driver's.
> In YARN cluster mode, they would also be added to the driver's class
> path.
>
> On Tue, May 2, 2017 at 8:43 AM, Nan Zhu <zhunanmcg...@gmail.com> wrote:
> > Hi, all
> >
> > For some reason, I tried to pass in a HDFS path to the --jars option in
> > spark-submit
> >
> > According to the document,
> > http://spark.apache.org/docs/latest/submitting-
> applications.html#advanced-dependency-management,
> > --jars would accept remote path
> >
> > However, in the implementation,
> > https://github.com/apache/spark/blob/c622a87c44e0621e1b3024fdca9b2a
> a3c508615b/core/src/main/scala/org/apache/spark/deploy/
> SparkSubmit.scala#L757,
> > it does not look like so
> >
> > Did I miss anything?
> >
> > Best,
> >
> > Nan
>
>
>
> --
> Marcelo
>

--jars does not take remote jar?

2017-05-02 Thread Nan Zhu

Hi, all

For some reason, I tried to pass in a HDFS path to the --jars option in
spark-submit

According to the document,
http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management,
--jars would accept remote path

However, in the implementation,
https://github.com/apache/spark/blob/c622a87c44e0621e1b3024fdca9b2aa3c508615b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L757,
it does not look like so

Did I miss anything?

Best,

Nan

Re: Azure Event Hub with Pyspark

2017-04-20 Thread Nan Zhu

DocDB does have a java client? Anything prevent you using that?

Get Outlook for iOS

From: ayan guha 
Sent: Thursday, April 20, 2017 9:24:03 PM
To: Ashish Singh
Cc: user
Subject: Re: Azure Event Hub with Pyspark

Hi

yes, its only scala. I am looking for a pyspark version, as i want to write to 
documentDB which has good python integration.

Thanks in advance

best
Ayan

On Fri, Apr 21, 2017 at 2:02 PM, Ashish Singh 
> wrote:
Hi ,

You can try https://github.com/hdinsight/spark-eventhubs : which is eventhub 
receiver for spark streaming
We are using it but you have scala version only i guess

Thanks,
Ashish Singh

On Fri, Apr 21, 2017 at 9:19 AM, ayan guha 
> wrote:
[Boxbe] 
[http://www.boxbe.com/stfopen?tc_serial=29917919067_rand=1256538785_source=stf_medium=email_campaign=ANNO_CLEANUP_ADD_content=001]
  This message is eligible for Automatic Cleanup! 
(guha.a...@gmail.com) Add cleanup 
rule
 | More 
info

Hi

I am not able to find any conector to be used to connect spark streaming with 
Azure Event Hub, using pyspark.

Does anyone know if there is such library/package exists>?

--
Best Regards,
Ayan Guha

--
Best Regards,
Ayan Guha

[Package Release] Widely accepted XGBoost now available in Spark

2016-03-16 Thread Nan Zhu

Dear Spark Users and Developers, 

(we apologize if you receive multiple copies of the email, we are resending 
because we found that our email was not delivered to user mail list correctly)
We are happy to announce the release of XGBoost4J 
(http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html),
 a Portable Distributed XGBoost in Spark, Flink and Dataflow XGBoost is an 
optimized distributed gradient boosting library designed to be highly 
efficient, flexible and portable.XGBoost provides a parallel tree boosting 
(also known as GBDT, GBM) that solve many data science problems in a fast and 
accurate way. It has been the winning solution for many machine learning 
scenarios, ranging from Machine Learning Challenges to Industrial User Cases 
XGBoost4J is a new package in XGBoost aiming to provide the clean Scala/Java 
APIs and the seamless integration with the mainstream data processing platform, 
like Apache Spark. With XGBoost4J, users can run XGBoost as a stage of Spark 
job and build a unified pipeline from ETL to Model training to data product 
service within Spark, instead of jumping across two different systems, i.e. 
XGBoost and Sp
ark. Today, we release the first version of XGBoost4J to bring more choices to 
the Spark users who are seeking the solutions to build highly efficient data 
analytic platform and enrich the Spark ecosystem. We will keep moving forward 
to integrate with more features of Spark. Of course, you are more than welcome 
to join us and contribute to the project! For more details of distributed 
XGBoost, you can refer to the recently published paper: 
http://arxiv.org/abs/1603.02754 Best, -- Nan Zhu http://codingcat.me

Release Announcement: XGBoost4J - Portable Distributed XGBoost in Spark, Flink and Dataflow

2016-03-15 Thread Nan Zhu

Dear Spark Users and Developers, 

We (Distributed (Deep) Machine Learning Community (http://dmlc.ml/)) are happy 
to announce the release of XGBoost4J 
(http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html),
 a Portable Distributed XGBoost in Spark, Flink and Dataflow 

XGBoost is an optimized distributed gradient boosting library designed to be 
highly efficient, flexible and portable.XGBoost provides a parallel tree 
boosting (also known as GBDT, GBM) that solve many data science problems in a 
fast and accurate way. It has been the winning solution for many machine 
learning scenarios, ranging from Machine Learning Challenges 
(https://github.com/dmlc/xgboost/tree/master/demo#machine-learning-challenge-winning-solutions)
 to Industrial User Cases 
(https://github.com/dmlc/xgboost/tree/master/demo#usecases) 

XGBoost4J is a new package in XGBoost aiming to provide the clean Scala/Java 
APIs and the seamless integration with the mainstream data processing platform, 
like Apache Spark. With XGBoost4J, users can run XGBoost as a stage of Spark 
job and build a unified pipeline from ETL to Model training to data product 
service within Spark, instead of jumping across two different systems, i.e. 
XGBoost and Spark. (Example: 
https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/spark/DistTrainWithSpark.scala)

Today, we release the first version of XGBoost4J to bring more choices to the 
Spark users who are seeking the solutions to build highly efficient data 
analytic platform and enrich the Spark ecosystem. We will keep moving forward 
to integrate with more features of Spark. Of course, you are more than welcome 
to join us and contribute to the project!

For more details of distributed XGBoost, you can refer to the recently 
published paper: http://arxiv.org/abs/1603.02754

Best, 

-- 
Nan Zhu
http://codingcat.me

Re: Failing MiMa tests

2016-03-14 Thread Nan Zhu

I guess it’s Jenkins’ problem? My PR was failed for MiMa but still got a 
message from SparkQA (https://github.com/SparkQA)  saying that  "This patch 
passes all tests."

I checked Jenkins’ history, there are other PRs with the same issue….  

Best,

--  
Nan Zhu
http://codingcat.me

On Monday, March 14, 2016 at 10:26 PM, Ted Yu wrote:  
> Please refer to JIRAs which were related to MiMa
> e.g.
> [SPARK-13834][BUILD] Update sbt and sbt plugins for 2.x.
>  
> It would be easier for other people to help if you provide link to your PR.
>  
> Cheers
>  
> On Mon, Mar 14, 2016 at 7:22 PM, Gayathri Murali <gayathri.m.sof...@gmail.com 
> (mailto:gayathri.m.sof...@gmail.com)> wrote:
> > Hi All,
> >  
> > I recently submitted a patch(which was passing all tests) with some minor 
> > modification to an existing PR. This patch is failing MiMa tests. Locally 
> > it passes all unit and style check tests. How do I fix MiMa test failures?
> >  
> > Thanks
> > Gayathri
> >  
> >  
> >  
>

Re: Paper on Spark SQL

2015-08-17 Thread Nan Zhu

an extra “,” is at the end

--  
Nan Zhu
http://codingcat.me


On Monday, August 17, 2015 at 9:28 AM, Ted Yu wrote:

 I got 404 when trying to access the link.  
  
  
  
 On Aug 17, 2015, at 5:31 AM, Todd bit1...@163.com (mailto:bit1...@163.com) 
 wrote:
  
  Hi,
  I can't access 
  http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf. 
  (http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf,)
  Could someone help try to see if  it is available and reply with it?Thanks!

Re: [SparkScore]Performance portal for Apache Spark - WW26

2015-06-26 Thread Nan Zhu

Thank you, Jie! Very nice work!

--  
Nan Zhu
http://codingcat.me


On Friday, June 26, 2015 at 8:17 AM, Huang, Jie wrote:

 Correct. Your calculation is right!  
   
 We have been aware of that kmeans performance drop also. According to our 
 observation, it is caused by some unbalanced executions among different 
 tasks. Even we used the same test data between different versions (i.e., not 
 caused by the data skew).
   
 And the corresponding run time information has been shared with Xiangrui. Now 
 he is also helping to identify the root cause altogether.  
   
 Thank you  Best Regards,
 Grace （Huang Jie)
   
 From: Nan Zhu [mailto:zhunanmcg...@gmail.com]  
 Sent: Friday, June 26, 2015 7:59 PM
 To: Huang, Jie
 Cc: user@spark.apache.org (mailto:user@spark.apache.org); 
 d...@spark.apache.org (mailto:d...@spark.apache.org)
 Subject: Re: [SparkScore]Performance portal for Apache Spark - WW26  
   
 Hi, Jie,  
  
   
  
 Thank you very much for this work! Very helpful!
  
   
  
 I just would like to confirm that I understand the numbers correctly: if we 
 take the running time of 1.2 release as 100s
  
   
  
 9.1% - means the running time is 109.1 s?
  
   
  
 -4% - means it comes 96s?
  
   
  
 If that’s the true meaning of the numbers, what happened to k-means in 
 HiBench?
  
   
  
 Best,
  
   
  
 --  
  
 Nan Zhu
  
 http://codingcat.me
  
   
  
  
 On Friday, June 26, 2015 at 7:24 AM, Huang, Jie wrote:
  Intel® Xeon® CPU E5-2697

Re: [SparkScore]Performance portal for Apache Spark - WW26

2015-06-26 Thread Nan Zhu

Hi, Jie,  

Thank you very much for this work! Very helpful!

I just would like to confirm that I understand the numbers correctly: if we 
take the running time of 1.2 release as 100s

9.1% - means the running time is 109.1 s?

-4% - means it comes 96s?

If that’s the true meaning of the numbers, what happened to k-means in HiBench?

Best,  

--  
Nan Zhu
http://codingcat.me


On Friday, June 26, 2015 at 7:24 AM, Huang, Jie wrote:

 Intel® Xeon® CPU E5-2697

Re: What happened to the Row class in 1.3.0?

2015-04-06 Thread Nan Zhu

Row class was not documented mistakenly in 1.3.0

you can check the 1.3.1 API doc 
http://people.apache.org/~pwendell/spark-1.3.1-rc1-docs/api/scala/index.html#org.apache.spark.sql.Row

Best, 

-- 
Nan Zhu
http://codingcat.me


On Monday, April 6, 2015 at 10:23 AM, ARose wrote:

 I am trying to call Row.create(object[]) similarly to what's shown in this
 programming guide
 https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
  
 , but the create() method is no longer recognized. I tried to look up the
 documentation for the Row api, but it does not seem to exist:
 http://people.apache.org/~pwendell/spark-1.3.0-snapshot1-docs/api/scala/index.html#org.apache.spark.sql.api.java.Row
 
 Is there a new equivalent for doing this programmatic specification of
 schema in 1.3.0?
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/What-happened-to-the-Row-class-in-1-3-0-tp22389.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com 
 (http://Nabble.com).
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 (mailto:user-unsubscr...@spark.apache.org)
 For additional commands, e-mail: user-h...@spark.apache.org 
 (mailto:user-h...@spark.apache.org)

Re: What happened to the Row class in 1.3.0?

2015-04-06 Thread Nan Zhu

Hi, Ted  

It’s here: 
https://github.com/apache/spark/blob/61b427d4b1c4934bd70ed4da844b64f0e9a377aa/sql/catalyst/src/main/java/org/apache/spark/sql/RowFactory.java

Best,  

--  
Nan Zhu
http://codingcat.me


On Monday, April 6, 2015 at 10:44 AM, Ted Yu wrote:

 I searched code base but didn't find RowFactory class.
  
 Pardon me.
  
 On Mon, Apr 6, 2015 at 7:39 AM, Ted Yu yuzhih...@gmail.com 
 (mailto:yuzhih...@gmail.com) wrote:
  From scaladoc of sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala 
  :
   
   * To create a new Row, use [[RowFactory.create()]] in Java or 
  [[Row.apply()]] in Scala.
   *
   
   
  Cheers
   
  On Mon, Apr 6, 2015 at 7:23 AM, ARose ashley.r...@telarix.com 
  (mailto:ashley.r...@telarix.com) wrote:
   I am trying to call Row.create(object[]) similarly to what's shown in  
   this
   programming guide
   https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
   , but the create() method is no longer recognized. I tried to look up the
   documentation for the Row api, but it does not seem to exist:
   http://people.apache.org/~pwendell/spark-1.3.0-snapshot1-docs/api/scala/index.html#org.apache.spark.sql.api.java.Row

   Is there a new equivalent for doing this programmatic specification of
   schema in 1.3.0?




   --
   View this message in context: 
   http://apache-spark-user-list.1001560.n3.nabble.com/What-happened-to-the-Row-class-in-1-3-0-tp22389.html
   Sent from the Apache Spark User List mailing list archive at Nabble.com 
   (http://Nabble.com).

   -
   To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
   (mailto:user-unsubscr...@spark.apache.org)
   For additional commands, e-mail: user-h...@spark.apache.org 
   (mailto:user-h...@spark.apache.org)

Re: java.io.NotSerializableException: org.apache.hadoop.hbase.client.Result

2015-03-31 Thread Nan Zhu

The example in 
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/HBaseTest.scala
 might help

Best, 

-- 
Nan Zhu
http://codingcat.me


On Tuesday, March 31, 2015 at 3:56 PM, Sean Owen wrote:

 Yep, it's not serializable:
 https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Result.html
 
 You can't return this from a distributed operation since that would
 mean it has to travel over the network and you haven't supplied any
 way to convert the thing into bytes.
 
 On Tue, Mar 31, 2015 at 8:51 PM, Jeetendra Gangele gangele...@gmail.com 
 (mailto:gangele...@gmail.com) wrote:
  When I am trying to get the result from Hbase and running mapToPair function
  of RRD its giving the error
  java.io.NotSerializableException: org.apache.hadoop.hbase.client.Result
  
  Here is the code
  
  // private static JavaPairRDDInteger, Result
  getCompanyDataRDD(JavaSparkContext sc) throws IOException {
  // return sc.newAPIHadoopRDD(companyDAO.getCompnayDataConfiguration(),
  TableInputFormat.class, ImmutableBytesWritable.class,
  // Result.class).mapToPair(new
  PairFunctionTuple2ImmutableBytesWritable, Result, Integer, Result() {
  //
  // public Tuple2Integer, Result call(Tuple2ImmutableBytesWritable,
  Result t) throws Exception {
  // System.out.println(In getCompanyDataRDD+t._2);
  //
  // String cknid = Bytes.toString(t._1.get());
  // System.out.println(processing cknids is:+cknid);
  // Integer cknidInt = Integer.parseInt(cknid);
  // Tuple2Integer, Result returnTuple = new Tuple2Integer,
  Result(cknidInt, t._2);
  // return returnTuple;
  // }
  // });
  // }
  
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 (mailto:user-unsubscr...@spark.apache.org)
 For additional commands, e-mail: user-h...@spark.apache.org 
 (mailto:user-h...@spark.apache.org)

Re: How to use more executors

2015-03-11 Thread Nan Zhu

at least 1.4 I think  

now using YARN or allowing multiple worker instances are just fine

Best,  

--  
Nan Zhu
http://codingcat.me


On Wednesday, March 11, 2015 at 8:42 PM, Du Li wrote:

 Is it being merged in the next release? It's indeed a critical patch!
  
 Du  
  
  
 On Wednesday, January 21, 2015 3:59 PM, Nan Zhu zhunanmcg...@gmail.com 
 (mailto:zhunanmcg...@gmail.com) wrote:
  
  
 …not sure when will it be reviewed…
  
 but for now you can work around by allowing multiple worker instances on a 
 single machine  
  
 http://spark.apache.org/docs/latest/spark-standalone.html
  
 search SPARK_WORKER_INSTANCES
  
 Best,  
  
 --  
 Nan Zhu
 http://codingcat.me
  
 On Wednesday, January 21, 2015 at 6:50 PM, Larry Liu wrote:
  Will  SPARK-1706 be included in next release?
   
  On Wed, Jan 21, 2015 at 2:50 PM, Ted Yu yuzhih...@gmail.com 
  (mailto:yuzhih...@gmail.com) wrote:
   Please see SPARK-1706

   On Wed, Jan 21, 2015 at 2:43 PM, Larry Liu larryli...@gmail.com 
   (mailto:larryli...@gmail.com) wrote:
I tried to submit a job with  --conf spark.cores.max=6  or 
--total-executor-cores 6 on a standalone cluster. But I don't see more 
than 1 executor on each worker. I am wondering how to use multiple 
executors when submitting jobs.
 
Thanks
larry

Re: How to use more executors

2015-03-11 Thread Nan Zhu

I think this should go to another PR

can you create a JIRA on that?

Best,  

--  
Nan Zhu
http://codingcat.me


On Wednesday, March 11, 2015 at 8:50 PM, Du Li wrote:

 Is it possible to extend this PR further (or create another PR) to allow for 
 per-node configuration of workers?  
  
 There are many discussions about heterogeneous spark cluster. Currently 
 configuration on master will override those on the workers. Many spark users 
 have the need for having machines with different cpu/memory capacities in the 
 same cluster.
  
 Du  
  
  
 On Wednesday, January 21, 2015 3:59 PM, Nan Zhu zhunanmcg...@gmail.com 
 (mailto:zhunanmcg...@gmail.com) wrote:
  
  
 …not sure when will it be reviewed…
  
 but for now you can work around by allowing multiple worker instances on a 
 single machine  
  
 http://spark.apache.org/docs/latest/spark-standalone.html
  
 search SPARK_WORKER_INSTANCES
  
 Best,  
  
 --  
 Nan Zhu
 http://codingcat.me
  
 On Wednesday, January 21, 2015 at 6:50 PM, Larry Liu wrote:
  Will  SPARK-1706 be included in next release?
   
  On Wed, Jan 21, 2015 at 2:50 PM, Ted Yu yuzhih...@gmail.com 
  (mailto:yuzhih...@gmail.com) wrote:
   Please see SPARK-1706

   On Wed, Jan 21, 2015 at 2:43 PM, Larry Liu larryli...@gmail.com 
   (mailto:larryli...@gmail.com) wrote:
I tried to submit a job with  --conf spark.cores.max=6  or 
--total-executor-cores 6 on a standalone cluster. But I don't see more 
than 1 executor on each worker. I am wondering how to use multiple 
executors when submitting jobs.
 
Thanks
larry

Re: No overwrite flag for saveAsXXFile

2015-03-06 Thread Nan Zhu

Actually, except setting spark.hadoop.validateOutputSpecs to false to disable 
output validation for the whole program  

Spark implementation uses a Dynamic Variable (object PairRDDFunctions) 
internally to disable it in a case-by-case manner

val disableOutputSpecValidation: DynamicVariable[Boolean] = new 
DynamicVariable[Boolean](false)

I’m not sure if there is enough amount of benefits to make it worth exposing 
this variable to the user…  

Best,  

--  
Nan Zhu
http://codingcat.me


On Friday, March 6, 2015 at 10:22 AM, Ted Yu wrote:

 Found this thread:
 http://search-hadoop.com/m/JW1q5HMrge2
  
 Cheers
  
 On Fri, Mar 6, 2015 at 6:42 AM, Sean Owen so...@cloudera.com 
 (mailto:so...@cloudera.com) wrote:
  This was discussed in the past and viewed as dangerous to enable. The
  biggest problem, by far, comes when you have a job that output M
  partitions, 'overwriting' a directory of data containing N  M old
  partitions. You suddenly have a mix of new and old data.
   
  It doesn't match Hadoop's semantics either, which won't let you do
  this. You can of course simply remove the output directory.
   
  On Fri, Mar 6, 2015 at 2:20 PM, Ted Yu yuzhih...@gmail.com 
  (mailto:yuzhih...@gmail.com) wrote:
   Adding support for overwrite flag would make saveAsXXFile more user 
   friendly.
  
   Cheers
  
  
  
   On Mar 6, 2015, at 2:14 AM, Jeff Zhang zjf...@gmail.com 
   (mailto:zjf...@gmail.com) wrote:
  
   Hi folks,
  
   I found that RDD:saveXXFile has no overwrite flag which I think is very 
   helpful. Is there any reason for this ?
  
  
  
   --
   Best Regards
  
   Jeff Zhang
  
   -
   To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
   (mailto:user-unsubscr...@spark.apache.org)
   For additional commands, e-mail: user-h...@spark.apache.org 
   (mailto:user-h...@spark.apache.org)

Re: multiple sparkcontexts and streamingcontexts

2015-03-02 Thread Nan Zhu

there are some “hidden” APIs potentially addressing your problem (but with a 
bit complexity)

by using the Actor Receiver, you can tell the supervisor of the actor receiver 
create another actor receiver for you, the ActorRef of the newly created Actor 
will be sent to the caller of the API (in most of cases, that’s one of the 
existing actor receivers)

The limitation might be that,  

all receivers are on the same machine...


Here is a PR trying to expose the APIs to the user: 
https://github.com/apache/spark/pull/3984

Best,  

--  
Nan Zhu
http://codingcat.me


On Monday, March 2, 2015 at 10:19 AM, Tamas Jambor wrote:

 Sorry, I meant once the stream is started, it's not possible to create new 
 streams in the existing streaming context, and it's not possible to create 
 new streaming context if another one is already running.
 So the only feasible option seemed to create new sparkcontexts for each 
 stream (tried using spark-jobserver for that).
  
  
 On Mon, Mar 2, 2015 at 3:07 PM, Sean Owen so...@cloudera.com 
 (mailto:so...@cloudera.com) wrote:
  You can make a new StreamingContext on an existing SparkContext, I believe?
   
  On Mon, Mar 2, 2015 at 3:01 PM, Tamas Jambor jambo...@gmail.com 
  (mailto:jambo...@gmail.com) wrote:
   thanks for the reply.
  
   Actually, our main problem is not really about sparkcontext, the problem 
   is
   that spark does not allow to create streaming context dynamically, and 
   once
   a stream is shut down, a new one cannot be created in the same 
   sparkcontext.
   So we cannot create a service that would create and manage multiple 
   streams
   - the same way that is possible with batch jobs.
  
   On Mon, Mar 2, 2015 at 2:52 PM, Sean Owen so...@cloudera.com 
   (mailto:so...@cloudera.com) wrote:
  
   I think everything there is to know about it is on JIRA; I don't think
   that's being worked on.
  
   On Mon, Mar 2, 2015 at 2:50 PM, Tamas Jambor jambo...@gmail.com 
   (mailto:jambo...@gmail.com) wrote:
I have seen there is a card (SPARK-2243) to enable that. Is that still
going
ahead?
   
On Mon, Mar 2, 2015 at 2:46 PM, Sean Owen so...@cloudera.com 
(mailto:so...@cloudera.com) wrote:
   
It is still not something you're supposed to do; in fact there is a
setting (disabled by default) that throws an exception if you try to
make multiple contexts.
   
On Mon, Mar 2, 2015 at 2:43 PM, jamborta jambo...@gmail.com 
(mailto:jambo...@gmail.com) wrote:
 hi all,

 what is the current status and direction on enabling multiple
 sparkcontexts
 and streamingcontext? I have seen a few issues open on JIRA, which
 seem
 to
 be there for quite a while.

 thanks,



 --
 View this message in context:

 http://apache-spark-user-list.1001560.n3.nabble.com/multiple-sparkcontexts-and-streamingcontexts-tp21876.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com (http://Nabble.com).

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 (mailto:user-unsubscr...@spark.apache.org)
 For additional commands, e-mail: user-h...@spark.apache.org 
 (mailto:user-h...@spark.apache.org)

Re: How to use more executors

2015-01-21 Thread Nan Zhu

…not sure when will it be reviewed…

but for now you can work around by allowing multiple worker instances on a 
single machine  

http://spark.apache.org/docs/latest/spark-standalone.html

search SPARK_WORKER_INSTANCES

Best,  

--  
Nan Zhu
http://codingcat.me


On Wednesday, January 21, 2015 at 6:50 PM, Larry Liu wrote:

 Will  SPARK-1706 be included in next release?
  
 On Wed, Jan 21, 2015 at 2:50 PM, Ted Yu yuzhih...@gmail.com 
 (mailto:yuzhih...@gmail.com) wrote:
  Please see SPARK-1706
   
  On Wed, Jan 21, 2015 at 2:43 PM, Larry Liu larryli...@gmail.com 
  (mailto:larryli...@gmail.com) wrote:
   I tried to submit a job with  --conf spark.cores.max=6  or 
   --total-executor-cores 6 on a standalone cluster. But I don't see more 
   than 1 executor on each worker. I am wondering how to use multiple 
   executors when submitting jobs.

   Thanks
   larry

Re: enable debug-level log output of akka?

2015-01-14 Thread Nan Zhu

I quickly went through the code,  

In ExecutorBackend, we build the actor system with  

// Create SparkEnv using properties we fetched from the driver.
val driverConf = new SparkConf().setAll(props)
val env = SparkEnv.createExecutorEnv(
driverConf, executorId, hostname, port, cores, isLocal = false)

props is the spark properties fetched from driver,  

In Driver side, we start driver actor with  

val properties = new ArrayBuffer[(String, String)]
for ((key, value) - scheduler.sc.conf.getAll) {
if (key.startsWith(spark.)) {
properties += ((key, value))
}
}
// TODO (prashant) send conf instead of properties
driverActor = actorSystem.actorOf(
Props(new DriverActor(properties)), name = 
CoarseGrainedSchedulerBackend.ACTOR_NAME)
This “properties” is the stuff being sent to executor. It seems that we only 
admit the properties starting with “spark.”

It seems that we cannot pass akka.* to executor?
Hi, Josh, would you mind giving some hints, as you created and closed the JIRA?
Best,


--  
Nan Zhu



On Wednesday, January 14, 2015 at 6:19 PM, Nan Zhu wrote:

 Hi, Ted,  
  
 Thanks
  
 I know how to set in Akka’s context, my question is just how to pass this 
 aka.loglevel=DEBUG to Spark’s actor system
  
 Best,
 --  
 Nan Zhu
 http://codingcat.me
  
  
 On Wednesday, January 14, 2015 at 6:09 PM, Ted Yu wrote:
  
  I assume you have looked at:
   
  http://doc.akka.io/docs/akka/2.0/scala/logging.html
  http://doc.akka.io/docs/akka/current/additional/faq.html (Debugging, last 
  question)
   
  Cheers
   
  On Wed, Jan 14, 2015 at 2:55 PM, Nan Zhu zhunanmcg...@gmail.com 
  (mailto:zhunanmcg...@gmail.com) wrote:
   Hi, all

   though https://issues.apache.org/jira/browse/SPARK-609 was closed, I’m 
   still unclear about how to enable debug level log output in Spark’s actor 
   system  

   Anyone can give the suggestion?

   BTW, I think we need to document it on somewhere, as the user who writes 
   actor-based receiver of spark streaming, (like me), usually needs the 
   detailed log for debugging….

   Best,  

   --  
   Nan Zhu
   http://codingcat.me

Re: enable debug-level log output of akka?

2015-01-14 Thread Nan Zhu

sorry for the mistake,  

I found that those akka related messages are from Spark Akka-related component 
(ActorLogReceive) , instead of Akka itself, though it has been enough for the 
debugging purpose (in my case)

the question in this thread is still in open status….

Best,  

--  
Nan Zhu
http://codingcat.me


On Wednesday, January 14, 2015 at 6:56 PM, Nan Zhu wrote:

 for others who have the same question:
  
 you can simply set logging level in log4j.properties to DEBUG to achieve this
  
 Best,  
  
 --  
 Nan Zhu
 http://codingcat.me
  
  
 On Wednesday, January 14, 2015 at 6:28 PM, Nan Zhu wrote:
  
  I quickly went through the code,  
   
  In ExecutorBackend, we build the actor system with  
   
  // Create SparkEnv using properties we fetched from the driver.
  val driverConf = new SparkConf().setAll(props)
  val env = SparkEnv.createExecutorEnv(
  driverConf, executorId, hostname, port, cores, isLocal = false)
   
  props is the spark properties fetched from driver,  
   
  In Driver side, we start driver actor with  
   
  val properties = new ArrayBuffer[(String, String)]
  for ((key, value) - scheduler.sc (http://scheduler.sc).conf.getAll) {
  if (key.startsWith(spark.)) {
  properties += ((key, value))
  }
  }
  // TODO (prashant) send conf instead of properties
  driverActor = actorSystem.actorOf(
  Props(new DriverActor(properties)), name = 
  CoarseGrainedSchedulerBackend.ACTOR_NAME)
  This “properties” is the stuff being sent to executor. It seems that we 
  only admit the properties starting with “spark.”
   
  It seems that we cannot pass akka.* to executor?
  Hi, Josh, would you mind giving some hints, as you created and closed the 
  JIRA?
  Best,
   
   
  --  
  Nan Zhu
   
   
  On Wednesday, January 14, 2015 at 6:19 PM, Nan Zhu wrote:
   
   Hi, Ted,  

   Thanks

   I know how to set in Akka’s context, my question is just how to pass this 
   aka.loglevel=DEBUG to Spark’s actor system

   Best,
   --  
   Nan Zhu
   http://codingcat.me


   On Wednesday, January 14, 2015 at 6:09 PM, Ted Yu wrote:

I assume you have looked at:
 
http://doc.akka.io/docs/akka/2.0/scala/logging.html
http://doc.akka.io/docs/akka/current/additional/faq.html (Debugging, 
last question)
 
Cheers
 
On Wed, Jan 14, 2015 at 2:55 PM, Nan Zhu zhunanmcg...@gmail.com 
(mailto:zhunanmcg...@gmail.com) wrote:
 Hi, all
  
 though https://issues.apache.org/jira/browse/SPARK-609 was closed, 
 I’m still unclear about how to enable debug level log output in 
 Spark’s actor system  
  
 Anyone can give the suggestion?
  
 BTW, I think we need to document it on somewhere, as the user who 
 writes actor-based receiver of spark streaming, (like me), usually 
 needs the detailed log for debugging….
  
 Best,  
  
 --  
 Nan Zhu
 http://codingcat.me

Re: enable debug-level log output of akka?

2015-01-14 Thread Nan Zhu

for others who have the same question:

you can simply set logging level in log4j.properties to DEBUG to achieve this

Best,  

--  
Nan Zhu
http://codingcat.me


On Wednesday, January 14, 2015 at 6:28 PM, Nan Zhu wrote:

 I quickly went through the code,  
  
 In ExecutorBackend, we build the actor system with  
  
 // Create SparkEnv using properties we fetched from the driver.
 val driverConf = new SparkConf().setAll(props)
 val env = SparkEnv.createExecutorEnv(
 driverConf, executorId, hostname, port, cores, isLocal = false)
  
 props is the spark properties fetched from driver,  
  
 In Driver side, we start driver actor with  
  
 val properties = new ArrayBuffer[(String, String)]
 for ((key, value) - scheduler.sc (http://scheduler.sc).conf.getAll) {
 if (key.startsWith(spark.)) {
 properties += ((key, value))
 }
 }
 // TODO (prashant) send conf instead of properties
 driverActor = actorSystem.actorOf(
 Props(new DriverActor(properties)), name = 
 CoarseGrainedSchedulerBackend.ACTOR_NAME)
 This “properties” is the stuff being sent to executor. It seems that we only 
 admit the properties starting with “spark.”
  
 It seems that we cannot pass akka.* to executor?
 Hi, Josh, would you mind giving some hints, as you created and closed the 
 JIRA?
 Best,
  
  
 --  
 Nan Zhu
  
  
 On Wednesday, January 14, 2015 at 6:19 PM, Nan Zhu wrote:
  
  Hi, Ted,  
   
  Thanks
   
  I know how to set in Akka’s context, my question is just how to pass this 
  aka.loglevel=DEBUG to Spark’s actor system
   
  Best,
  --  
  Nan Zhu
  http://codingcat.me
   
   
  On Wednesday, January 14, 2015 at 6:09 PM, Ted Yu wrote:
   
   I assume you have looked at:

   http://doc.akka.io/docs/akka/2.0/scala/logging.html
   http://doc.akka.io/docs/akka/current/additional/faq.html (Debugging, last 
   question)

   Cheers

   On Wed, Jan 14, 2015 at 2:55 PM, Nan Zhu zhunanmcg...@gmail.com 
   (mailto:zhunanmcg...@gmail.com) wrote:
Hi, all
 
though https://issues.apache.org/jira/browse/SPARK-609 was closed, I’m 
still unclear about how to enable debug level log output in Spark’s 
actor system  
 
Anyone can give the suggestion?
 
BTW, I think we need to document it on somewhere, as the user who 
writes actor-based receiver of spark streaming, (like me), usually 
needs the detailed log for debugging….
 
Best,  
 
--  
Nan Zhu
http://codingcat.me

Re: enable debug-level log output of akka?

2015-01-14 Thread Nan Zhu

Hi, Ted,  

Thanks

I know how to set in Akka’s context, my question is just how to pass this 
aka.loglevel=DEBUG to Spark’s actor system

Best,
--  
Nan Zhu
http://codingcat.me


On Wednesday, January 14, 2015 at 6:09 PM, Ted Yu wrote:

 I assume you have looked at:
  
 http://doc.akka.io/docs/akka/2.0/scala/logging.html
 http://doc.akka.io/docs/akka/current/additional/faq.html (Debugging, last 
 question)
  
 Cheers
  
 On Wed, Jan 14, 2015 at 2:55 PM, Nan Zhu zhunanmcg...@gmail.com 
 (mailto:zhunanmcg...@gmail.com) wrote:
  Hi, all
   
  though https://issues.apache.org/jira/browse/SPARK-609 was closed, I’m 
  still unclear about how to enable debug level log output in Spark’s actor 
  system  
   
  Anyone can give the suggestion?
   
  BTW, I think we need to document it on somewhere, as the user who writes 
  actor-based receiver of spark streaming, (like me), usually needs the 
  detailed log for debugging….
   
  Best,  
   
  --  
  Nan Zhu
  http://codingcat.me

enable debug-level log output of akka?

2015-01-14 Thread Nan Zhu

Hi, all

though https://issues.apache.org/jira/browse/SPARK-609 was closed, I’m still 
unclear about how to enable debug level log output in Spark’s actor system  

Anyone can give the suggestion?

BTW, I think we need to document it on somewhere, as the user who writes 
actor-based receiver of spark streaming, (like me), usually needs the detailed 
log for debugging….

Best,  

--  
Nan Zhu
http://codingcat.me

Re: MLUtil.kfold generates overlapped training and validation set?

2014-10-10 Thread Nan Zhu

Thanks, Xiangrui,   

I found the reason of overlapped training set and test set

….

Another counter-intuitive issue related to 
https://github.com/apache/spark/pull/2508

Best,  

--  
Nan Zhu


On Friday, October 10, 2014 at 2:19 AM, Xiangrui Meng wrote:

 1. No.
  
 2. The seed per partition is fixed. So it should generate
 non-overlapping subsets.
  
 3. There was a bug in 1.0, which was fixed in 1.0.1 and 1.1.
  
 Best,
 Xiangrui
  
 On Thu, Oct 9, 2014 at 11:05 AM, Nan Zhu zhunanmcg...@gmail.com 
 (mailto:zhunanmcg...@gmail.com) wrote:
  Hi, all
   
  When we use MLUtils.kfold to generate training and validation set for cross
  validation
   
  we found that there is overlapped part in two sets….
   
  from the code, it does sampling for twice for the same dataset
   
  @Experimental
  def kFold[T: ClassTag](rdd: RDD[T], numFolds: Int, seed: Int):
  Array[(RDD[T], RDD[T])] = {
  val numFoldsF = numFolds.toFloat
  (1 to numFolds).map { fold =
  val sampler = new BernoulliSampler[T]((fold - 1) / numFoldsF, fold /
  numFoldsF,
  complement = false)
  val validation = new PartitionwiseSampledRDD(rdd, sampler, true, seed)
  val training = new PartitionwiseSampledRDD(rdd,
  sampler.cloneComplement(), true, seed)
  (training, validation)
  }.toArray
  }
   
  the sampler is complement, there is still possibility to generate overlapped
  training and validation set
   
  because the sampling method looks like :
   
  override def sample(items: Iterator[T]): Iterator[T] = {
  items.filter { item =
  val x = rng.nextDouble()
  (x = lb  x  ub) ^ complement
  }
  }
   
  I’m not a machine learning guy, so I guess I must fall into one of the
  following three situations
   
  1. does it mean actually we allow overlapped training and validation set ?
  (counter intuitive to me)
   
  2. I had some misunderstanding on the code?
   
  3. it’s a bug?
   
  Anyone can explain it to me?
   
  Best,
   
  --
  Nan Zhu

Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Nan Zhu

Great! Congratulations! 

-- 
Nan Zhu


On Friday, October 10, 2014 at 11:19 AM, Mridul Muralidharan wrote:

 Brilliant stuff ! Congrats all :-)
 This is indeed really heartening news !
 
 Regards,
 Mridul
 
 
 On Fri, Oct 10, 2014 at 8:24 PM, Matei Zaharia matei.zaha...@gmail.com 
 (mailto:matei.zaha...@gmail.com) wrote:
  Hi folks,
  
  I interrupt your regularly scheduled user / dev list to bring you some 
  pretty cool news for the project, which is that we've been able to use 
  Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x 
  faster on 10x fewer nodes. There's a detailed writeup at 
  http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
   Summary: while Hadoop MapReduce held last year's 100 TB world record by 
  sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on 
  206 nodes; and we also scaled up to sort 1 PB in 234 minutes.
  
  I want to thank Reynold Xin for leading this effort over the past few 
  weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali 
  Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for 
  providing the machines to make this possible. Finally, this result would of 
  course not be possible without the many many other contributions, testing 
  and feature requests from throughout the community.
  
  For an engine to scale from these multi-hour petabyte batch jobs down to 
  100-millisecond streaming and interactive queries is quite uncommon, and 
  it's thanks to all of you folks that we are able to make this happen.
  
  Matei
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
  (mailto:dev-unsubscr...@spark.apache.org)
  For additional commands, e-mail: dev-h...@spark.apache.org 
  (mailto:dev-h...@spark.apache.org)
  
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 (mailto:user-unsubscr...@spark.apache.org)
 For additional commands, e-mail: user-h...@spark.apache.org 
 (mailto:user-h...@spark.apache.org)

Re: Akka disassociation on Java SE Embedded

2014-10-10 Thread Nan Zhu

https://github.com/CodingCat/spark/commit/c5cee24689ac4ad1187244e6a16537452e99e771
 

-- 
Nan Zhu


On Friday, October 10, 2014 at 4:31 PM, bhusted wrote:

 How do you increase the spark block manager timeout?
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Akka-disassociation-on-Java-SE-Embedded-tp6266p16176.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com 
 (http://Nabble.com).
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 (mailto:user-unsubscr...@spark.apache.org)
 For additional commands, e-mail: user-h...@spark.apache.org 
 (mailto:user-h...@spark.apache.org)

MLUtil.kfold generates overlapped training and validation set?

2014-10-09 Thread Nan Zhu

Hi, all  

When we use MLUtils.kfold to generate training and validation set for cross 
validation  

we found that there is overlapped part in two sets….

from the code, it does sampling for twice for the same dataset

 @Experimental
  def kFold[T: ClassTag](rdd: RDD[T], numFolds: Int, seed: Int): Array[(RDD[T], 
RDD[T])] = {
val numFoldsF = numFolds.toFloat
(1 to numFolds).map { fold =
  val sampler = new BernoulliSampler[T]((fold - 1) / numFoldsF, fold / 
numFoldsF,
complement = false)
  val validation = new PartitionwiseSampledRDD(rdd, sampler, true, seed)
  val training = new PartitionwiseSampledRDD(rdd, 
sampler.cloneComplement(), true, seed)
  (training, validation)
}.toArray
  }


the sampler is complement, there is still possibility to generate overlapped 
training and validation set  

because the sampling method looks like :

override def sample(items: Iterator[T]): Iterator[T] = {
items.filter { item =
  val x = rng.nextDouble()
  (x = lb  x  ub) ^ complement
}
  }


I’m not a machine learning guy, so I guess I must fall into one of the 
following three situations

1. does it mean actually we allow overlapped training and validation set ? 
(counter intuitive to me)

2. I had some misunderstanding on the code?  

3. it’s a bug?

Anyone can explain it to me?

Best,  

--  
Nan Zhu

Re: Reading from HBase is too slow

2014-09-29 Thread Nan Zhu

can you look at your HBase UI to check whether your job is just reading from a 
single region server? 

Best, 

-- 
Nan Zhu


On Monday, September 29, 2014 at 10:21 PM, Tao Xiao wrote:

 I submitted a job in Yarn-Client mode, which simply reads from a HBase table 
 containing tens of millions of records and then does a count action. The job 
 runs for a much longer time than I expected, so I wonder whether it was 
 because the data to read was too much. Actually, there are 20 nodes in my 
 Hadoop cluster so the HBase table seems not so big (tens of millopns of 
 records). :
 
 I'm using CDH 5.0.0 (Spark 0.9 and HBase 0.96).
 
 BTW, when the job was running, I can see logs on the console, and 
 specifically I'd like to know what the following log means:
 
  14/09/30 09:45:20 INFO scheduler.TaskSetManager: Starting task 0.0:20 as 
  TID 20 on executor 2: b04.jsepc.com (http://b04.jsepc.com) (PROCESS_LOCAL)
  14/09/30 09:45:20 INFO scheduler.TaskSetManager: Serialized task 0.0:20 as 
  13454 bytes in 0 ms
  14/09/30 09:45:20 INFO scheduler.TaskSetManager: Finished TID 19 in 16426 
  ms on b04.jsepc.com (http://b04.jsepc.com) (progress: 18/86)
  14/09/30 09:45:20 INFO scheduler.DAGScheduler: Completed ResultTask(0, 19)
  
 
 Thanks

Re: executorAdded event to DAGScheduler

2014-09-26 Thread Nan Zhu

just a quick reply, we cannot start two executors in the same host for a single 
application in the standard deployment (one worker per machine)  

I’m not sure if it will create an issue when you have multiple workers in the 
same host, as submitWaitingStages is called everywhere and I never try such a 
deployment mode

Best,  

--  
Nan Zhu


On Friday, September 26, 2014 at 8:02 AM, praveen seluka wrote:

 Can someone explain the motivation behind passing executorAdded event to 
 DAGScheduler ? DAGScheduler does submitWaitingStages when executorAdded 
 method is called by TaskSchedulerImpl. I see some issue in the below code,
  
 TaskSchedulerImpl.scala code
 if (!executorsByHost.contains(o.host)) {
 executorsByHost(o.host) = new HashSet[String]()
 executorAdded(o.executorId, o.host)
 newExecAvail = true
   }
  
  
 Note that executorAdded is called only when there is a new host and not for 
 every new executor. For instance, there can be two executors in the same host 
 and in this case. (But DAGScheduler executorAdded is notified only for new 
 host - so only once in this case). If this is indeed an issue, I would like 
 to submit a patch for this quickly. [cc Andrew Or]
  
 - Praveen

Re: Distributed dictionary building

2014-09-23 Thread Nan Zhu

great, thanks 

-- 
Nan Zhu


On Tuesday, September 23, 2014 at 9:58 AM, Sean Owen wrote:

 Yes, Matei made a JIRA last week and I just suggested a PR:
 https://github.com/apache/spark/pull/2508 
 On Sep 23, 2014 2:55 PM, Nan Zhu zhunanmcg...@gmail.com 
 (mailto:zhunanmcg...@gmail.com) wrote:
  shall we document this in the API doc? 
  
  Best, 
  
  -- 
  Nan Zhu
  
  
  On Sunday, September 21, 2014 at 12:18 PM, Debasish Das wrote:
  
   zipWithUniqueId is also affected...
   
   I had to persist the dictionaries to make use of the indices lower down 
   in the flow...
   
   On Sun, Sep 21, 2014 at 1:15 AM, Sean Owen so...@cloudera.com 
   (mailto:so...@cloudera.com) wrote:
Reference - https://issues.apache.org/jira/browse/SPARK-3098
I imagine zipWithUniqueID is also affected, but may not happen to have
exhibited in your test.

On Sun, Sep 21, 2014 at 2:13 AM, Debasish Das debasish.da...@gmail.com 
(mailto:debasish.da...@gmail.com) wrote:
 Some more debug revealed that as Sean said I have to keep the 
 dictionaries
 persisted till I am done with the RDD manipulation.

 Thanks Sean for the pointer...would it be possible to point me to the 
 JIRA
 as well ?

 Are there plans to make it more transparent for the users ?

 Is it possible for the DAG to speculate such things...similar to 
 branch
 prediction ideas from comp arch...



 On Sat, Sep 20, 2014 at 1:56 PM, Debasish Das 
 debasish.da...@gmail.com (mailto:debasish.da...@gmail.com)
 wrote:

 I changed zipWithIndex to zipWithUniqueId and that seems to be 
 working...

 What's the difference between zipWithIndex vs zipWithUniqueId ?

 For zipWithIndex we don't need to run the count to compute the offset
 which is needed for zipWithUniqueId and so zipWithIndex is efficient 
 ? It's
 not very clear from docs...


 On Sat, Sep 20, 2014 at 1:48 PM, Debasish Das 
 debasish.da...@gmail.com (mailto:debasish.da...@gmail.com)
 wrote:

 I did not persist / cache it as I assumed zipWithIndex will preserve
 order...

 There is also zipWithUniqueId...I am trying that...If that also 
 shows the
 same issue, we should make it clear in the docs...

 On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen so...@cloudera.com 
 (mailto:so...@cloudera.com) wrote:

 From offline question - zipWithIndex is being used to assign IDs. 
 From a
 recent JIRA discussion I understand this is not deterministic 
 within a
 partition so the index can be different when the RDD is 
 reevaluated. If you
 need it fixed, persist the zipped RDD on disk or in memory.

 On Sep 20, 2014 8:10 PM, Debasish Das debasish.da...@gmail.com 
 (mailto:debasish.da...@gmail.com)
 wrote:

 Hi,

 I am building a dictionary of RDD[(String, Long)] and after the
 dictionary is built and cached, I find key almonds at value 
 5187 using:

 rdd.filter{case(product, index) = product == almonds}.collect

 Output:

 Debug product almonds index 5187

 Now I take the same dictionary and write it out as:

 dictionary.map{case(product, index) = product + , + index}
 .saveAsTextFile(outputPath)

 Inside the map I also print what's the product at index 5187 and 
 I get
 a different product:

 Debug Index 5187 userOrProduct cardigans

 Is this an expected behavior from map ?

 By the way almonds and apparel-cardigans are just one off in 
 the
 index...

 I am using spark-1.1 but it's a snapshot..

 Thanks.
 Deb

Re: Distributed dictionary building

2014-09-23 Thread Nan Zhu

shall we document this in the API doc? 

Best, 

-- 
Nan Zhu


On Sunday, September 21, 2014 at 12:18 PM, Debasish Das wrote:

 zipWithUniqueId is also affected...
 
 I had to persist the dictionaries to make use of the indices lower down in 
 the flow...
 
 On Sun, Sep 21, 2014 at 1:15 AM, Sean Owen so...@cloudera.com 
 (mailto:so...@cloudera.com) wrote:
  Reference - https://issues.apache.org/jira/browse/SPARK-3098
  I imagine zipWithUniqueID is also affected, but may not happen to have
  exhibited in your test.
  
  On Sun, Sep 21, 2014 at 2:13 AM, Debasish Das debasish.da...@gmail.com 
  (mailto:debasish.da...@gmail.com) wrote:
   Some more debug revealed that as Sean said I have to keep the dictionaries
   persisted till I am done with the RDD manipulation.
  
   Thanks Sean for the pointer...would it be possible to point me to the JIRA
   as well ?
  
   Are there plans to make it more transparent for the users ?
  
   Is it possible for the DAG to speculate such things...similar to branch
   prediction ideas from comp arch...
  
  
  
   On Sat, Sep 20, 2014 at 1:56 PM, Debasish Das debasish.da...@gmail.com 
   (mailto:debasish.da...@gmail.com)
   wrote:
  
   I changed zipWithIndex to zipWithUniqueId and that seems to be working...
  
   What's the difference between zipWithIndex vs zipWithUniqueId ?
  
   For zipWithIndex we don't need to run the count to compute the offset
   which is needed for zipWithUniqueId and so zipWithIndex is efficient ? 
   It's
   not very clear from docs...
  
  
   On Sat, Sep 20, 2014 at 1:48 PM, Debasish Das debasish.da...@gmail.com 
   (mailto:debasish.da...@gmail.com)
   wrote:
  
   I did not persist / cache it as I assumed zipWithIndex will preserve
   order...
  
   There is also zipWithUniqueId...I am trying that...If that also shows 
   the
   same issue, we should make it clear in the docs...
  
   On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen so...@cloudera.com 
   (mailto:so...@cloudera.com) wrote:
  
   From offline question - zipWithIndex is being used to assign IDs. From 
   a
   recent JIRA discussion I understand this is not deterministic within a
   partition so the index can be different when the RDD is reevaluated. 
   If you
   need it fixed, persist the zipped RDD on disk or in memory.
  
   On Sep 20, 2014 8:10 PM, Debasish Das debasish.da...@gmail.com 
   (mailto:debasish.da...@gmail.com)
   wrote:
  
   Hi,
  
   I am building a dictionary of RDD[(String, Long)] and after the
   dictionary is built and cached, I find key almonds at value 5187 
   using:
  
   rdd.filter{case(product, index) = product == almonds}.collect
  
   Output:
  
   Debug product almonds index 5187
  
   Now I take the same dictionary and write it out as:
  
   dictionary.map{case(product, index) = product + , + index}
   .saveAsTextFile(outputPath)
  
   Inside the map I also print what's the product at index 5187 and I get
   a different product:
  
   Debug Index 5187 userOrProduct cardigans
  
   Is this an expected behavior from map ?
  
   By the way almonds and apparel-cardigans are just one off in the
   index...
  
   I am using spark-1.1 but it's a snapshot..
  
   Thanks.
   Deb

Re: Some Serious Issue with Spark Streaming ? Blocks Getting Removed and Jobs have Failed..

2014-09-11 Thread Nan Zhu

Hi,   

Can you attach more logs to see if there is some entry from ContextCleaner?

I met very similar issue before…but haven’t get resolved  

Best,  

--  
Nan Zhu


On Thursday, September 11, 2014 at 10:13 AM, Dibyendu Bhattacharya wrote:

 Dear All,  
  
 Not sure if this is a false alarm. But wanted to raise to this to understand 
 what is happening.  
  
 I am testing the Kafka Receiver which I have written 
 (https://github.com/dibbhatt/kafka-spark-consumer) which basically a low 
 level Kafka Consumer implemented custom Receivers for every Kafka topic 
 partitions and pulling data in parallel. Individual streams from all topic 
 partitions are then merged to create Union stream which used for further 
 processing.
  
 The custom Receiver working fine in normal load with no issues. But when I 
 tested this with huge amount of backlog messages from Kafka ( 50 million + 
 messages), I see couple of major issue in Spark Streaming. Wanted to get some 
 opinion on this
  
 I am using latest Spark 1.1 taken from the source and built it. Running in 
 Amazon EMR , 3 m1.xlarge Node Spark cluster running in Standalone Mode.
  
 Below are two main question I have..
  
 1. What I am seeing when I run the Spark Streaming with my Kafka Consumer 
 with a huge backlog in Kafka ( around 50 Million), Spark is completely busy 
 performing the Receiving task and hardly schedule any processing task. Can 
 you let me if this is expected ? If there is large backlog, Spark will take 
 long time pulling them . Why Spark not doing any processing ? Is it because 
 of resource limitation ( say all cores are busy puling ) or it is by design ? 
 I am setting the executor-memory to 10G and driver-memory to 4G .
  
 2. This issue seems to be more serious. I have attached the Driver trace with 
 this email. What I can see very frequently Block are selected to be 
 Removed...This kind of entries are all over the place. But when a Block is 
 removed , below problem happen May be this issue cause the issue 1 that 
 no Jobs are getting processed ..
  
  
 INFO : org.apache.spark.storage.MemoryStore - 1 blocks selected for dropping
 INFO : org.apache.spark.storage.BlockManager - Dropping block 
 input-0-1410443074600 from memory
 INFO : org.apache.spark.storage.MemoryStore - Block input-0-1410443074600 of 
 size 12651900 dropped from memory (free 21220667)
 INFO : org.apache.spark.storage.BlockManagerInfo - Removed 
 input-0-1410443074600 on ip-10-252-5-113.asskickery.us:53752 
 (http://ip-10-252-5-113.asskickery.us:53752) in memory (size: 12.1 MB, free: 
 100.6 MB)
  
 ...
  
 INFO : org.apache.spark.storage.BlockManagerInfo - Removed 
 input-0-1410443074600 on ip-10-252-5-62.asskickery.us:37033 
 (http://ip-10-252-5-62.asskickery.us:37033) in memory (size: 12.1 MB, free: 
 154.6 MB)
 ..
  
  
 WARN : org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 7.0 
 (TID 118, ip-10-252-5-62.asskickery.us 
 (http://ip-10-252-5-62.asskickery.us)): java.lang.Exception: Could not 
 compute split, block input-0-1410443074600 not found
  
 ...
  
 INFO : org.apache.spark.scheduler.TaskSetManager - Lost task 0.1 in stage 7.0 
 (TID 126) on executor ip-10-252-5-62.asskickery.us 
 (http://ip-10-252-5-62.asskickery.us): java.lang.Exception (Could not compute 
 split, block input-0-1410443074600 not found) [duplicate 1]
  
  
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 
 (TID 139, ip-10-252-5-62.asskickery.us 
 (http://ip-10-252-5-62.asskickery.us)): java.lang.Exception: Could not 
 compute split, block input-0-1410443074600 not found
 org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:744)
  
  
 Regards,  
 Dibyendu
  
  
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 (mailto:user-unsubscr...@spark.apache.org)
 For additional commands, e-mail: user-h...@spark.apache.org 
 (mailto:user-h...@spark.apache.org)
  
  
  
  
 Attachments:  
 - driver-trace.txt

Re: Some Serious Issue with Spark Streaming ? Blocks Getting Removed and Jobs have Failed..

2014-09-11 Thread Nan Zhu

) at 
sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source) at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606) at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at 
scala.collection.immutable.$colon$colon.readObject(List.scala:362) at 
sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source) at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606) at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
 at org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61) 
at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141) at 
java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
 at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:169) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
at java.lang.Thread.run(Thread.java:744)



--  
Nan Zhu


On Thursday, September 11, 2014 at 10:42 AM, Nan Zhu wrote:

 Hi,   
  
 Can you attach more logs to see if there is some entry from ContextCleaner?
  
 I met very similar issue before…but haven’t get resolved  
  
 Best,  
  
 --  
 Nan Zhu
  
  
 On Thursday, September 11, 2014 at 10:13 AM, Dibyendu Bhattacharya wrote:
  
  Dear All,  
   
  Not sure if this is a false alarm. But wanted to raise to this to 
  understand what is happening.  
   
  I am testing the Kafka Receiver which I have written 
  (https://github.com/dibbhatt/kafka-spark-consumer) which basically a low 
  level Kafka Consumer implemented custom Receivers for every Kafka topic 
  partitions and pulling data in parallel. Individual streams from all topic 
  partitions are then merged to create Union stream which used for further 
  processing.
   
  The custom Receiver working fine in normal load with no issues. But when I 
  tested this with huge amount of backlog messages from Kafka ( 50 million + 
  messages), I see couple of major issue in Spark Streaming. Wanted to get 
  some opinion on this
   
  I am using latest Spark 1.1 taken from the source and built it. Running in 
  Amazon EMR , 3 m1.xlarge Node Spark cluster running in Standalone Mode.
   
  Below are two main question I have..
   
  1. What I am seeing when I run the Spark Streaming with my Kafka Consumer 
  with a huge backlog in Kafka ( around 50 Million), Spark is completely busy 
  performing the Receiving task and hardly schedule any processing task. Can 
  you let me if this is expected ? If there is large backlog, Spark will take 
  long time pulling them . Why Spark not doing any processing ? Is it because 
  of resource limitation ( say all cores are busy puling ) or it is by design 
  ? I am setting the executor-memory to 10G and driver-memory to 4G .
   
  2

Re: Upgrading 1.0.0 to 1.0.2

2014-08-26 Thread Nan Zhu

Hi, Victor, 

the issue for you to have different version in driver and cluster is that you 
the master will shutdown your application due to the inconsistent 
SerialVersionID in ExecutorState

Best, 

-- 
Nan Zhu


On Tuesday, August 26, 2014 at 10:10 PM, Matei Zaharia wrote:

 Things will definitely compile, and apps compiled on 1.0.0 should even be 
 able to link against 1.0.2 without recompiling. The only problem is if you 
 run your driver with 1.0.0 on its classpath, but the cluster has 1.0.2 in 
 executors.
 
 For Mesos and YARN vs standalone, the difference is that they just have more 
 features, at the expense of more complicated setup. For example, they have 
 richer support for cross-application sharing (see 
 https://spark.apache.org/docs/latest/job-scheduling.html), and the ability to 
 run non-Spark applications on the same cluster.
 
 Matei 
 
 On August 26, 2014 at 6:53:33 PM, Victor Tso-Guillen (v...@paxata.com 
 (mailto:v...@paxata.com)) wrote:
 
  Yes, we are standalone right now. Do you have literature why one would want 
  to consider Mesos or YARN for Spark deployments? 
  
  Sounds like I should try upgrading my project and seeing if everything 
  compiles without modification. Then I can connect to an existing 1.0.0 
  cluster and see what what happens... 
  
  Thanks, Matei :) 
  
  
  On Tue, Aug 26, 2014 at 6:37 PM, Matei Zaharia matei.zaha...@gmail.com 
  (mailto:matei.zaha...@gmail.com) wrote:
   Is this a standalone mode cluster? We don't currently make this 
   guarantee, though it will likely work in 1.0.0 to 1.0.2. The problem 
   though is that the standalone mode grabs the executors' version of Spark 
   code from what's installed on the cluster, while your driver might be 
   built against another version. On YARN and Mesos, you can more easily mix 
   different versions of Spark, since each application ships its own Spark 
   JAR (or references one from a URL), and this is used for both the driver 
   and executors. 
   
   Matei 
   
   
   On August 26, 2014 at 6:10:57 PM, Victor Tso-Guillen (v...@paxata.com 
   (mailto:v...@paxata.com)) wrote:
   
I wanted to make sure that there's full compatibility between minor 
releases. I have a project that has a dependency on spark-core so that 
it can be a driver program and that I can test locally. However, when 
connecting to a cluster you don't necessarily know what version you're 
connecting to. Is a 1.0.0 cluster binary compatible with a 1.0.2 driver 
program? Is a 1.0.0 driver program binary compatible with a 1.0.2 
cluster?

SELECT DISTINCT generates random results?

2014-08-05 Thread Nan Zhu

Hi, all  

I use “SELECT DISTINCT” to query the data saved in hive

it seems that this statement cannot understand the table structure and just 
output the data in other fields

Anyone met the similar problem before?

Best,  

--  
Nan Zhu

Re: SELECT DISTINCT generates random results?

2014-08-05 Thread Nan Zhu

nvm,   

some problem brought by the ill-formatted raw data  

--  
Nan Zhu


On Tuesday, August 5, 2014 at 3:42 PM, Nan Zhu wrote:

 Hi, all  
  
 I use “SELECT DISTINCT” to query the data saved in hive
  
 it seems that this statement cannot understand the table structure and just 
 output the data in other fields
  
 Anyone met the similar problem before?
  
 Best,  
  
 --  
 Nan Zhu

DROP IF EXISTS still throws exception about table does not exist?

2014-07-21 Thread Nan Zhu

Hi, all  

When I try hiveContext.hql(drop table if exists abc) where abc is a non-exist 
table  

I still received an exception about non-exist table though if exists is there

the same statement runs well in hive shell

Some feedback from Hive community is here: 
https://issues.apache.org/jira/browse/HIVE-7458

“Your are doing hiveContext.hql(DROP TABLE IF EXISTS hivetesting) in Scala 
schell of the Spark project.

What this shell is doing ? Query to remote metastore on non existing table (see 
on your provided stack).
The remote metastore throws NoSuchObjectException(message:default.hivetesting 
table not found)because Spark code call 
HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:854) on non-existing 
table. It's the right behavior.
You should check on Spark code why a query is done on non existing table.


I think Spark does not handle well the IF EXISTS part of this query. Maybe you 
could fill a ticket on Spark JIRA.

BUT, it's not a bug in HIVE IMHO.”

My question is the DDL is executed by Hive itself, doesn’t it?

Best,  

--  
Nan Zhu

broadcast variable get cleaned by ContextCleaner unexpectedly ?

2014-07-21 Thread Nan Zhu

(ObjectInputStream.java:1350) at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at 
scala.collection.immutable.$colon$colon.readObject(List.scala:362) at 
sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source) at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606) at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at 
scala.collection.immutable.$colon$colon.readObject(List.scala:362) at 
sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source) at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606) at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
 at org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61) 
at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141) at 
java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
 at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:169) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
at java.lang.Thread.run(Thread.java:744)

I highlighted the lines indicating the ContextCleaner cleaned the broadcast 
variable, I’m wondering why the variable is cleaned, since there are enough 
memory space?

Best,  

--  
Nan Zhu

Re: broadcast variable get cleaned by ContextCleaner unexpectedly ?

2014-07-21 Thread Nan Zhu

Hi, TD,   

Thanks for the reply

I tried to reproduce this in a simpler program, but no luck

However, the program has been very simple, just load some files from HDFS and 
write them to HBase….

---

It seems that the issue only appears when I run the unit test in Jenkins (not 
fail every time, in usual, it will success in 1/10 times)

I once suspected that it’s related to some concurrency issue, but even I 
disable the parallel test in built.sbt, the problem is still there

---

Best,  

--  
Nan Zhu


On Monday, July 21, 2014 at 5:40 PM, Tathagata Das wrote:

 The ContextCleaner cleans up data and metadata related to RDDs and broadcast 
 variables, only when those variables are not in scope and get 
 garbage-collected by the JVM. So if the broadcast variable in question is 
 probably somehow going out of scope even before the job using the broadcast 
 variable is in progress.  
  
 Could you reproduce this behavior reliably in a simple code snippet that you 
 can share with us?
  
 TD
  
  
  
 On Mon, Jul 21, 2014 at 2:29 PM, Nan Zhu zhunanmcg...@gmail.com 
 (mailto:zhunanmcg...@gmail.com) wrote:
  Hi, all  
   
  When I run some Spark application (actually unit test of the application in 
  Jenkins ), I found that I always hit the FileNotFoundException when reading 
  broadcast variable   
   
  The program itself works well, except the unit test
   
  Here is the example log:  
   
   
  14/07/21 19:49:13 INFO Executor: Running task ID 4 14/07/21 19:49:13 INFO 
  DAGScheduler: Completed ResultTask(0, 2) 14/07/21 19:49:13 INFO 
  TaskSetManager: Finished TID 2 in 95 ms on localhost (progress: 3/106) 
  14/07/21 19:49:13 INFO TableOutputFormat: Created table instance for 
  hdfstest_customers 14/07/21 19:49:13 INFO Executor: Serialized size of 
  result for 3 is 596 14/07/21 19:49:13 INFO Executor: Sending result for 3 
  directly to driver 14/07/21 19:49:13 INFO BlockManager: Found block 
  broadcast_0 locally 14/07/21 19:49:13 INFO Executor: Finished task ID 3 
  14/07/21 19:49:13 INFO TaskSetManager: Starting task 0.0:5 as TID 5 on 
  executor localhost: localhost (PROCESS_LOCAL) 14/07/21 19:49:13 INFO 
  TaskSetManager: Serialized task 0.0:5 as 11885 bytes in 0 ms 14/07/21 
  19:49:13 INFO Executor: Running task ID 5 14/07/21 19:49:13 INFO 
  BlockManager: Removing broadcast 0 14/07/21 19:49:13 INFO DAGScheduler: 
  Completed ResultTask(0, 3) 14/07/21 19:49:13 INFO ContextCleaner: Cleaned 
  broadcast 0 14/07/21 19:49:13 INFO TaskSetManager: Finished TID 3 in 97 ms 
  on localhost (progress: 4/106) 14/07/21 19:49:13 INFO BlockManager: Found 
  block broadcast_0 locally 14/07/21 19:49:13 INFO BlockManager: Removing 
  block broadcast_0 14/07/21 19:49:13 INFO MemoryStore: Block broadcast_0 of 
  size 202564 dropped from memory (free 886623436) 14/07/21 19:49:13 INFO 
  ContextCleaner: Cleaned shuffle 0 14/07/21 19:49:13 INFO 
  ShuffleBlockManager: Deleted all files for shuffle 0 14/07/21 19:49:13 INFO 
  HadoopRDD: Input split: 
  hdfs://172.31.34.184:9000/etltest/hdfsData/customer.csv:25+5 14/07/21 
  (http://172.31.34.184:9000/etltest/hdfsData/customer.csv:25+514/07/21) 
  19:49:13 INFO HadoopRDD: Input split: 
  hdfs://172.31.34.184:9000/etltest/hdfsData/customer.csv:20+5 14/07/21 
  (http://172.31.34.184:9000/etltest/hdfsData/customer.csv:20+514/07/21) 
  19:49:13 INFO TableOutputFormat: Created table instance for 
  hdfstest_customers 14/07/21 19:49:13 INFO Executor: Serialized size of 
  result for 4 is 596 14/07/21 19:49:13 INFO Executor: Sending result for 4 
  directly to driver 14/07/21 19:49:13 INFO Executor: Finished task ID 4 
  14/07/21 19:49:13 INFO TaskSetManager: Starting task 0.0:6 as TID 6 on 
  executor localhost: localhost (PROCESS_LOCAL) 14/07/21 19:49:13 INFO 
  TaskSetManager: Serialized task 0.0:6 as 11885 bytes in 0 ms 14/07/21 
  19:49:13 INFO Executor: Running task ID 6 14/07/21 19:49:13 INFO 
  DAGScheduler: Completed ResultTask(0, 4) 14/07/21 19:49:13 INFO 
  TaskSetManager: Finished TID 4 in 80 ms on localhost (progress: 5/106) 
  14/07/21 19:49:13 INFO TableOutputFormat: Created table instance for 
  hdfstest_customers 14/07/21 19:49:13 INFO Executor: Serialized size of 
  result for 5 is 596 14/07/21 19:49:13 INFO Executor: Sending result for 5 
  directly to driver 14/07/21 19:49:13 INFO Executor: Finished task ID 5 
  14/07/21 19:49:13 INFO TaskSetManager: Starting task 0.0:7 as TID 7 on 
  executor localhost: localhost (PROCESS_LOCAL) 14/07/21 19:49:13 INFO 
  TaskSetManager: Serialized task 0.0:7 as 11885 bytes in 0 ms 14/07/21 
  19:49:13 INFO Executor: Running task ID 7 14/07/21 19:49:13 INFO 
  DAGScheduler: Completed ResultTask(0, 5) 14/07/21 19:49:13 INFO 
  TaskSetManager: Finished TID 5 in 77 ms on localhost (progress: 6/106) 
  14/07/21 19:49:13 INFO HttpBroadcast: Started reading broadcast variable 0 
  14/07/21 19:49:13 INFO HttpBroadcast: Started reading broadcast variable 0 
  14/07/21 19:49:13 ERROR Executor: Exception in task ID 6

Re: broadcast variable get cleaned by ContextCleaner unexpectedly ?

2014-07-21 Thread Nan Zhu

Hi, TD,   

I think I got more insights to the problem

in the Jenkins test file, I mistakenly pass a wrong value to spark.cores.max, 
which is much larger than the expected value  

(I passed master address as local[6], and spark.core.max as 200)

If I set a more consistent value, everything goes well,  

But I do not think it will bring this problem even the spark.cores.max is too 
large?

Best,  

--  
Nan Zhu


On Monday, July 21, 2014 at 6:11 PM, Nan Zhu wrote:

 Hi, TD,   
  
 Thanks for the reply
  
 I tried to reproduce this in a simpler program, but no luck
  
 However, the program has been very simple, just load some files from HDFS and 
 write them to HBase….
  
 ---
  
 It seems that the issue only appears when I run the unit test in Jenkins (not 
 fail every time, in usual, it will success in 1/10 times)
  
 I once suspected that it’s related to some concurrency issue, but even I 
 disable the parallel test in built.sbt, the problem is still there
  
 ---
  
 Best,  
  
 --  
 Nan Zhu
  
  
 On Monday, July 21, 2014 at 5:40 PM, Tathagata Das wrote:
  
  The ContextCleaner cleans up data and metadata related to RDDs and 
  broadcast variables, only when those variables are not in scope and get 
  garbage-collected by the JVM. So if the broadcast variable in question is 
  probably somehow going out of scope even before the job using the broadcast 
  variable is in progress.  
   
  Could you reproduce this behavior reliably in a simple code snippet that 
  you can share with us?
   
  TD
   
   
   
  On Mon, Jul 21, 2014 at 2:29 PM, Nan Zhu zhunanmcg...@gmail.com 
  (mailto:zhunanmcg...@gmail.com) wrote:
   Hi, all  

   When I run some Spark application (actually unit test of the application 
   in Jenkins ), I found that I always hit the FileNotFoundException when 
   reading broadcast variable   

   The program itself works well, except the unit test

   Here is the example log:  


   14/07/21 19:49:13 INFO Executor: Running task ID 4 14/07/21 19:49:13 INFO 
   DAGScheduler: Completed ResultTask(0, 2) 14/07/21 19:49:13 INFO 
   TaskSetManager: Finished TID 2 in 95 ms on localhost (progress: 3/106) 
   14/07/21 19:49:13 INFO TableOutputFormat: Created table instance for 
   hdfstest_customers 14/07/21 19:49:13 INFO Executor: Serialized size of 
   result for 3 is 596 14/07/21 19:49:13 INFO Executor: Sending result for 3 
   directly to driver 14/07/21 19:49:13 INFO BlockManager: Found block 
   broadcast_0 locally 14/07/21 19:49:13 INFO Executor: Finished task ID 3 
   14/07/21 19:49:13 INFO TaskSetManager: Starting task 0.0:5 as TID 5 on 
   executor localhost: localhost (PROCESS_LOCAL) 14/07/21 19:49:13 INFO 
   TaskSetManager: Serialized task 0.0:5 as 11885 bytes in 0 ms 14/07/21 
   19:49:13 INFO Executor: Running task ID 5 14/07/21 19:49:13 INFO 
   BlockManager: Removing broadcast 0 14/07/21 19:49:13 INFO DAGScheduler: 
   Completed ResultTask(0, 3) 14/07/21 19:49:13 INFO ContextCleaner: Cleaned 
   broadcast 0 14/07/21 19:49:13 INFO TaskSetManager: Finished TID 3 in 97 
   ms on localhost (progress: 4/106) 14/07/21 19:49:13 INFO BlockManager: 
   Found block broadcast_0 locally 14/07/21 19:49:13 INFO BlockManager: 
   Removing block broadcast_0 14/07/21 19:49:13 INFO MemoryStore: Block 
   broadcast_0 of size 202564 dropped from memory (free 886623436) 14/07/21 
   19:49:13 INFO ContextCleaner: Cleaned shuffle 0 14/07/21 19:49:13 INFO 
   ShuffleBlockManager: Deleted all files for shuffle 0 14/07/21 19:49:13 
   INFO HadoopRDD: Input split: 
   hdfs://172.31.34.184:9000/etltest/hdfsData/customer.csv:25+5 14/07/21 
   (http://172.31.34.184:9000/etltest/hdfsData/customer.csv:25+514/07/21) 
   19:49:13 INFO HadoopRDD: Input split: 
   hdfs://172.31.34.184:9000/etltest/hdfsData/customer.csv:20+5 14/07/21 
   (http://172.31.34.184:9000/etltest/hdfsData/customer.csv:20+514/07/21) 
   19:49:13 INFO TableOutputFormat: Created table instance for 
   hdfstest_customers 14/07/21 19:49:13 INFO Executor: Serialized size of 
   result for 4 is 596 14/07/21 19:49:13 INFO Executor: Sending result for 4 
   directly to driver 14/07/21 19:49:13 INFO Executor: Finished task ID 4 
   14/07/21 19:49:13 INFO TaskSetManager: Starting task 0.0:6 as TID 6 on 
   executor localhost: localhost (PROCESS_LOCAL) 14/07/21 19:49:13 INFO 
   TaskSetManager: Serialized task 0.0:6 as 11885 bytes in 0 ms 14/07/21 
   19:49:13 INFO Executor: Running task ID 6 14/07/21 19:49:13 INFO 
   DAGScheduler: Completed ResultTask(0, 4) 14/07/21 19:49:13 INFO 
   TaskSetManager: Finished TID 4 in 80 ms on localhost (progress: 5/106) 
   14/07/21 19:49:13 INFO TableOutputFormat: Created table instance for 
   hdfstest_customers 14/07/21 19:49:13 INFO Executor: Serialized size of 
   result for 5 is 596 14/07/21 19:49:13 INFO Executor: Sending result for 5 
   directly to driver 14/07/21 19:49:13 INFO Executor: Finished task ID 5 
   14/07/21 19:49:13 INFO TaskSetManager: Starting task 0.0:7 as TID 7

Re: DROP IF EXISTS still throws exception about table does not exist?

2014-07-21 Thread Nan Zhu

Ah, I see,   

thanks, Yin  

--  
Nan Zhu


On Monday, July 21, 2014 at 5:00 PM, Yin Huai wrote:

 Hi Nan,
  
 It is basically a log entry because your table does not exist. It is not a 
 real exception.  
  
 Thanks,
  
 Yin
  
  
 On Mon, Jul 21, 2014 at 7:10 AM, Nan Zhu zhunanmcg...@gmail.com 
 (mailto:zhunanmcg...@gmail.com) wrote:
  a related JIRA: https://issues.apache.org/jira/browse/SPARK-2605  
   
  --  
  Nan Zhu
   
   
  On Monday, July 21, 2014 at 10:10 AM, Nan Zhu wrote:
   
   Hi, all  

   When I try hiveContext.hql(drop table if exists abc) where abc is a 
   non-exist table  

   I still received an exception about non-exist table though if exists is 
   there

   the same statement runs well in hive shell  

   Some feedback from Hive community is here: 
   https://issues.apache.org/jira/browse/HIVE-7458  

   “Your are doing hiveContext.hql(DROP TABLE IF EXISTS hivetesting) in 
   Scala schell of the Spark project.  

   What this shell is doing ? Query to remote metastore on non existing 
   table (see on your provided stack).
   The remote metastore throws 
   NoSuchObjectException(message:default.hivetesting table not found)because 
   Spark code call 
   HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:854) on 
   non-existing table. It's the right behavior.
   You should check on Spark code why a query is done on non existing table.


   I think Spark does not handle well the IF EXISTS part of this query. 
   Maybe you could fill a ticket on Spark JIRA.

   BUT, it's not a bug in HIVE IMHO.”

   My question is the DDL is executed by Hive itself, doesn’t it?

   Best,  

   --  
   Nan Zhu

Re: broadcast variable get cleaned by ContextCleaner unexpectedly ?

2014-07-21 Thread Nan Zhu

Ah, sorry, sorry, my brain just damaged….. sent some wrong information  

not “spark.cores.max” but the minPartitions in sc.textFile()  


Best,

--  
Nan Zhu


On Monday, July 21, 2014 at 7:17 PM, Tathagata Das wrote:

 That is definitely weird. spark.core.max should not affect thing when they 
 are running local mode.  
  
 And, I am trying to think of scenarios that could cause a broadcast variable 
 used in the current job to fall out of scope, but they all seem very far 
 fetched. So i am really curious to see the code where this could be 
 happening.  
  
 Either ways, you could turn off the behavior by using 
 spark.cleaner.referenceTracking=false
  
 TD
  
  
 On Mon, Jul 21, 2014 at 3:52 PM, Nan Zhu zhunanmcg...@gmail.com 
 (mailto:zhunanmcg...@gmail.com) wrote:
  Hi, TD,   
   
  I think I got more insights to the problem
   
  in the Jenkins test file, I mistakenly pass a wrong value to 
  spark.cores.max, which is much larger than the expected value   
   
  (I passed master address as local[6], and spark.core.max as 200)
   
  If I set a more consistent value, everything goes well,  
   
  But I do not think it will bring this problem even the spark.cores.max is 
  too large?  
   
  Best,  
   
  --  
  Nan Zhu
   
   
  On Monday, July 21, 2014 at 6:11 PM, Nan Zhu wrote:
   
   Hi, TD,   

   Thanks for the reply

   I tried to reproduce this in a simpler program, but no luck  

   However, the program has been very simple, just load some files from HDFS 
   and write them to HBase….

   ---

   It seems that the issue only appears when I run the unit test in Jenkins 
   (not fail every time, in usual, it will success in 1/10 times)  

   I once suspected that it’s related to some concurrency issue, but even I 
   disable the parallel test in built.sbt, the problem is still there  

   ---

   Best,  

   --  
   Nan Zhu


   On Monday, July 21, 2014 at 5:40 PM, Tathagata Das wrote:

The ContextCleaner cleans up data and metadata related to RDDs and 
broadcast variables, only when those variables are not in scope and get 
garbage-collected by the JVM. So if the broadcast variable in question 
is probably somehow going out of scope even before the job using the 
broadcast variable is in progress.  
 
Could you reproduce this behavior reliably in a simple code snippet 
that you can share with us?
 
TD
 
 
 
On Mon, Jul 21, 2014 at 2:29 PM, Nan Zhu zhunanmcg...@gmail.com 
(mailto:zhunanmcg...@gmail.com) wrote:
 Hi, all  
  
 When I run some Spark application (actually unit test of the 
 application in Jenkins ), I found that I always hit the 
 FileNotFoundException when reading broadcast variable   
  
 The program itself works well, except the unit test
  
 Here is the example log:  
  
  
 14/07/21 19:49:13 INFO Executor: Running task ID 4 14/07/21 19:49:13 
 INFO DAGScheduler: Completed ResultTask(0, 2) 14/07/21 19:49:13 INFO 
 TaskSetManager: Finished TID 2 in 95 ms on localhost (progress: 
 3/106) 14/07/21 19:49:13 INFO TableOutputFormat: Created table 
 instance for hdfstest_customers 14/07/21 19:49:13 INFO Executor: 
 Serialized size of result for 3 is 596 14/07/21 19:49:13 INFO 
 Executor: Sending result for 3 directly to driver 14/07/21 19:49:13 
 INFO BlockManager: Found block broadcast_0 locally 14/07/21 19:49:13 
 INFO Executor: Finished task ID 3 14/07/21 19:49:13 INFO 
 TaskSetManager: Starting task 0.0:5 as TID 5 on executor localhost: 
 localhost (PROCESS_LOCAL) 14/07/21 19:49:13 INFO TaskSetManager: 
 Serialized task 0.0:5 as 11885 bytes in 0 ms 14/07/21 19:49:13 INFO 
 Executor: Running task ID 5 14/07/21 19:49:13 INFO BlockManager: 
 Removing broadcast 0 14/07/21 19:49:13 INFO DAGScheduler: Completed 
 ResultTask(0, 3) 14/07/21 19:49:13 INFO ContextCleaner: Cleaned 
 broadcast 0 14/07/21 19:49:13 INFO TaskSetManager: Finished TID 3 in 
 97 ms on localhost (progress: 4/106) 14/07/21 19:49:13 INFO 
 BlockManager: Found block broadcast_0 locally 14/07/21 19:49:13 INFO 
 BlockManager: Removing block broadcast_0 14/07/21 19:49:13 INFO 
 MemoryStore: Block broadcast_0 of size 202564 dropped from memory 
 (free 886623436) 14/07/21 19:49:13 INFO ContextCleaner: Cleaned 
 shuffle 0 14/07/21 19:49:13 INFO ShuffleBlockManager: Deleted all 
 files for shuffle 0 14/07/21 19:49:13 INFO HadoopRDD: Input split: 
 hdfs://172.31.34.184:9000/etltest/hdfsData/customer.csv:25+5 14/07/21 
 (http://172.31.34.184:9000/etltest/hdfsData/customer.csv:25+514/07/21)
  19:49:13 INFO HadoopRDD: Input split: 
 hdfs://172.31.34.184:9000/etltest/hdfsData/customer.csv:20+5 14/07/21 
 (http://172.31.34.184:9000/etltest/hdfsData/customer.csv:20+514/07/21)
  19:49:13 INFO TableOutputFormat: Created table instance for 
 hdfstest_customers 14/07

try JDBC server

2014-07-11 Thread Nan Zhu

Hi, all 

I would like to give a try on JDBC server (which is supposed to be released in 
1.1)

where can I find the document about that?

Best, 

-- 
Nan Zhu

Re: try JDBC server

2014-07-11 Thread Nan Zhu

nvm 

for others with the same question:

https://github.com/apache/spark/commit/8032fe2fae3ac40a02c6018c52e76584a14b3438 

-- 
Nan Zhu


On Friday, July 11, 2014 at 7:02 PM, Nan Zhu wrote:

 Hi, all 
 
 I would like to give a try on JDBC server (which is supposed to be released 
 in 1.1)
 
 where can I find the document about that?
 
 Best, 
 
 -- 
 Nan Zhu

Re: master attempted to re-register the worker and then took all workers as unregistered

2014-07-07 Thread Nan Zhu

Hey, Cheney,  

The problem is still existing?

Sorry for the delay, I’m starting to look at this issue,  

Best,  

--  
Nan Zhu


On Tuesday, May 6, 2014 at 10:06 PM, Cheney Sun wrote:

 Hi Nan,
  
 In worker's log, I see the following exception thrown when try to launch on 
 executor. (The SPARK_HOME is wrongly specified on purpose, so there is no 
 such file /usr/local/spark1/bin/compute-classpath.sh 
 (http://compute-classpath.sh)).  
 After the exception was thrown several times, the worker was requested to 
 kill the executor. Following the killing, the worker try to register again 
 with master, but master reject the registration with WARN message Got 
 heartbeat from unregistered worker worker-20140504140005-host-spark-online001
  
 Looks like the issue wasn't fixed in 0.9.1. Do you know any pull request 
 addressing this issue? Thanks.
  
 java.io.IOException: Cannot run program 
 /usr/local/spark1/bin/compute-classpath.sh (http://compute-classpath.sh) 
 (in directory .): error=2, No such file or directory  
 at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:600)
 at 
 org.apache.spark.deploy.worker.CommandUtils$.buildJavaOpts(CommandUtils.scala:58)
 at 
 org.apache.spark.deploy.worker.CommandUtils$.buildCommandSeq(CommandUtils.scala:37)
 at 
 org.apache.spark.deploy.worker.ExecutorRunner.getCommandSeq(ExecutorRunner.scala:104)
 at 
 org.apache.spark.deploy.worker.ExecutorRunner.fetchAndRunExecutor(ExecutorRunner.scala:119)
 at 
 org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:59)
 Caused by: java.io.IOException: error=2, No such file or directory
 at java.lang.UNIXProcess.forkAndExec(Native Method)
 at java.lang.UNIXProcess.init(UNIXProcess.java:135)
 at java.lang.ProcessImpl.start(ProcessImpl.java:130)
 at java.lang.ProcessBuilder.start(ProcessBuilder.java:1021)
 ... 6 more
 ..
 14/05/04 21:35:45 INFO Worker: Asked to kill executor 
 app-20140504213545-0034/18
 14/05/04 21:35:45 INFO Worker: Executor app-20140504213545-0034/18 finished 
 with state FAILED message class java.io.IOException: Cannot run program 
 /usr/local/spark1/bin/compute-classpath.sh (http://compute-classpath.sh) 
 (in directory .): error=2, No such file or directory
 14/05/04 21:35:45 ERROR OneForOneStrategy: key not found: 
 app-20140504213545-0034/18
 java.util.NoSuchElementException: key not found: app-20140504213545-0034/18
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
 at 
 org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:232)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 14/05/04 21:35:45 INFO Worker: Starting Spark worker 
 host-spark-online001:7078 with 10 cores, 28.0 GB RAM
 14/05/04 21:35:45 INFO Worker: Spark home: /usr/local/spark-0.9.1-cdh4.2.0
 14/05/04 21:35:45 INFO WorkerWebUI: Started Worker web UI at 
 http://host-spark-online001:8081
 14/05/04 21:35:45 INFO Worker: Connecting to master 
 spark://host-spark-online001:7077...
 14/05/04 21:35:45 INFO Worker: Successfully registered with master 
 spark://host-spark-online001:7077

Re: long GC pause during file.cache()

2014-06-15 Thread Nan Zhu

SPARK_JAVA_OPTS is deprecated in 1.0, though it works fine if you don’t mind 
the WARNING in the logs

you can set spark.executor.extraJavaOpts in your SparkConf obj  

Best,

--  
Nan Zhu


On Sunday, June 15, 2014 at 12:13 PM, Hao Wang wrote:

 Hi, Wei
  
 You may try to set JVM opts in spark-env.sh (http://spark-env.sh) as follow 
 to prevent or mitigate GC pause:
  
 export SPARK_JAVA_OPTS=-XX:-UseGCOverheadLimit -XX:+UseConcMarkSweepGC 
 -Xmx2g -XX:MaxPermSize=256m
  
 There are more options you could add, please just Google :)  
  
  
 Regards,
 Wang Hao(王灏)
  
 CloudTeam | School of Software Engineering
 Shanghai Jiao Tong University
 Address:800 Dongchuan Road, Minhang District, Shanghai, 200240
 Email:wh.s...@gmail.com (mailto:wh.s...@gmail.com)
  
  
  
  
  
  
 On Sun, Jun 15, 2014 at 10:24 AM, Wei Tan w...@us.ibm.com 
 (mailto:w...@us.ibm.com) wrote:
  Hi,  
   
I have a single node (192G RAM) stand-alone spark, with memory 
  configuration like this in spark-env.sh (http://spark-env.sh)  
   
  SPARK_WORKER_MEMORY=180g  
  SPARK_MEM=180g  
   
   
   In spark-shell I have a program like this:  
   
  val file = sc.textFile(/localpath) //file size is 40G  
  file.cache()  
   
   
  val output = file.map(line = extract something from line)  
   
  output.saveAsTextFile (...)  
   
   
  When I run this program again and again, or keep trying file.unpersist() 
  -- file.cache() -- output.saveAsTextFile(), the run time varies a lot, 
  from 1 min to 3 min to 50+ min. Whenever the run-time is more than 1 min, 
  from the stage monitoring GUI I observe big GC pause (some can be 10+ min). 
  Of course when run-time is normal, say ~1 min, no significant GC is 
  observed. The behavior seems somewhat random.  
   
  Is there any JVM tuning I should do to prevent this long GC pause from 
  happening?  
   
   
   
  I used java-1.6.0-openjdk.x86_64, and my spark-shell process is something 
  like this:  
   
  root 10994  1.7  0.6 196378000 1361496 pts/51 Sl+ 22:06   0:12 
  /usr/lib/jvm/java-1.6.0-openjdk.x86_64/bin/java -cp 
  ::/home/wtan/scala/spark-1.0.0-bin-hadoop1/conf:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-core-3.2.2.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-rdbms-3.2.1.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-api-jdo-3.2.1.jar
   -XX:MaxPermSize=128m -Djava.library.path= -Xms180g -Xmx180g 
  org.apache.spark.deploy.SparkSubmit spark-shell --class 
  org.apache.spark.repl.Main  
   
  Best regards,  
  Wei  
   
  -  
  Wei Tan, PhD  
  Research Staff Member  
  IBM T. J. Watson Research Center  
  http://researcher.ibm.com/person/us-wtan

Re: long GC pause during file.cache()

2014-06-15 Thread Nan Zhu

Yes, I think in the spark-env.sh.template, it is listed in the comments (didn’t 
check….)  

Best,  

--  
Nan Zhu


On Sunday, June 15, 2014 at 5:21 PM, Surendranauth Hiraman wrote:

 Is SPARK_DAEMON_JAVA_OPTS valid in 1.0.0?
  
  
  
 On Sun, Jun 15, 2014 at 4:59 PM, Nan Zhu zhunanmcg...@gmail.com 
 (mailto:zhunanmcg...@gmail.com) wrote:
  SPARK_JAVA_OPTS is deprecated in 1.0, though it works fine if you don’t 
  mind the WARNING in the logs
   
  you can set spark.executor.extraJavaOpts in your SparkConf obj  
   
  Best,
   
  --  
  Nan Zhu
   
   
  On Sunday, June 15, 2014 at 12:13 PM, Hao Wang wrote:
   
   Hi, Wei

   You may try to set JVM opts in spark-env.sh (http://spark-env.sh) as 
   follow to prevent or mitigate GC pause:  

   export SPARK_JAVA_OPTS=-XX:-UseGCOverheadLimit -XX:+UseConcMarkSweepGC 
   -Xmx2g -XX:MaxPermSize=256m

   There are more options you could add, please just Google :)  


   Regards,
   Wang Hao(王灏)

   CloudTeam | School of Software Engineering
   Shanghai Jiao Tong University
   Address:800 Dongchuan Road, Minhang District, Shanghai, 200240
   Email:wh.s...@gmail.com (mailto:wh.s...@gmail.com)






   On Sun, Jun 15, 2014 at 10:24 AM, Wei Tan w...@us.ibm.com 
   (mailto:w...@us.ibm.com) wrote:
Hi,  
 
  I have a single node (192G RAM) stand-alone spark, with memory 
configuration like this in spark-env.sh (http://spark-env.sh)  
 
SPARK_WORKER_MEMORY=180g  
SPARK_MEM=180g  
 
 
 In spark-shell I have a program like this:  
 
val file = sc.textFile(/localpath) //file size is 40G  
file.cache()  
 
 
val output = file.map(line = extract something from line)  
 
output.saveAsTextFile (...)  
 
 
When I run this program again and again, or keep trying 
file.unpersist() -- file.cache() -- output.saveAsTextFile(), the run 
time varies a lot, from 1 min to 3 min to 50+ min. Whenever the 
run-time is more than 1 min, from the stage monitoring GUI I observe 
big GC pause (some can be 10+ min). Of course when run-time is 
normal, say ~1 min, no significant GC is observed. The behavior seems 
somewhat random.  
 
Is there any JVM tuning I should do to prevent this long GC pause from 
happening?  
 
 
 
I used java-1.6.0-openjdk.x86_64, and my spark-shell process is 
something like this:  
 
root 10994  1.7  0.6 196378000 1361496 pts/51 Sl+ 22:06   0:12 
/usr/lib/jvm/java-1.6.0-openjdk.x86_64/bin/java -cp 
::/home/wtan/scala/spark-1.0.0-bin-hadoop1/conf:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-core-3.2.2.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-rdbms-3.2.1.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-api-jdo-3.2.1.jar
 -XX:MaxPermSize=128m -Djava.library.path= -Xms180g -Xmx180g 
org.apache.spark.deploy.SparkSubmit spark-shell --class 
org.apache.spark.repl.Main  
 
Best regards,  
Wei  
 
-  
Wei Tan, PhD  
Research Staff Member  
IBM T. J. Watson Research Center  
http://researcher.ibm.com/person/us-wtan
   
  
  
  
 --  
  SUREN HIRAMAN, 
 VP TECHNOLOGY
 Velos
 Accelerating Machine Learning
  
 440 NINTH AVENUE, 11TH FLOOR
 NEW YORK, NY 10001
 O: (917) 525-2466 ext. 105
 F: 646.349.4063
 E: suren.hiraman@v (mailto:suren.hira...@sociocast.com)elos.io 
 (http://elos.io)
 W: www.velos.io (http://www.velos.io/)

Re: overwriting output directory

2014-06-12 Thread Nan Zhu

Hi, SK 

For 1.0.0 you have to delete it manually

in 1.0.1 there will be a parameter to enable overwriting 

https://github.com/apache/spark/pull/947/files

Best, 

-- 
Nan Zhu


On Thursday, June 12, 2014 at 1:57 PM, SK wrote:

 Hi,
 
 When we have multiple runs of a program writing to the same output file, the
 execution fails if the output directory already exists from a previous run.
 Is there some way we can have it overwrite the existing directory, so that
 we dont have to manually delete it after each run?
 
 Thanks for your help.
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/overwriting-output-directory-tp7498.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com 
 (http://Nabble.com).

Re: Writing data to HBase using Spark

2014-06-12 Thread Nan Zhu

you are using spark streaming?  

master = “local[n]” where n  1?

Best,  

--  
Nan Zhu


On Wednesday, June 11, 2014 at 4:23 AM, gaurav.dasgupta wrote:

 Hi Kanwaldeep,
  
 I have tried your code but arrived into a problem. The code is working fine 
 in local mode. But if I run the same code in Spark stand alone mode or YARN 
 mode, then it is continuously executing, but not saving anything in the HBase 
 table. I guess, it is stopping data streaming once the saveToHBase method is 
 called for the first time.
  
 This is strange. I just want to know whether you have tested the code on all 
 Spark execution modes?
  
 Thanks,
 Gaurav
  
  
 On Tue, Jun 10, 2014 at 12:20 PM, Kanwaldeep [via Apache Spark User List] 
 [hidden email] (/user/SendEmail.jtp?type=nodenode=7389i=0) wrote:
  Please see sample code attached at 
  https://issues.apache.org/jira/browse/SPARK-944.  
   
   
  If you reply to this email, your message will be added to the discussion 
  below: 
  http://apache-spark-user-list.1001560.n3.nabble.com/Writing-data-to-HBase-using-Spark-tp7304p7305.html

  To start a new topic under Apache Spark User List, email [hidden email] 
  (/user/SendEmail.jtp?type=nodenode=7389i=1)  
  To unsubscribe from Apache Spark User List, click here.
  NAML 
  (http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml)

  
 View this message in context: Re: Writing data to HBase using Spark 
 (http://apache-spark-user-list.1001560.n3.nabble.com/Writing-data-to-HBase-using-Spark-tp7304p7389.html)
 Sent from the Apache Spark User List mailing list archive 
 (http://apache-spark-user-list.1001560.n3.nabble.com/) at Nabble.com 
 (http://Nabble.com).

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-12 Thread Nan Zhu

Actually this has been merged to the master branch 

https://github.com/apache/spark/pull/947 

-- 
Nan Zhu


On Thursday, June 12, 2014 at 2:39 PM, Daniel Siegmann wrote:

 The old behavior (A) was dangerous, so it's good that (B) is now the default. 
 But in some cases I really do want to replace the old data, as per (C). For 
 example, I may rerun a previous computation (perhaps the input data was 
 corrupt and I'm rerunning with good input).
 
 Currently I have to write separate code to remove the files before calling 
 Spark. It would be very convenient if Spark could do this for me. Has anyone 
 created a JIRA issue to support (C)?
 
 
 On Mon, Jun 9, 2014 at 3:02 AM, Aaron Davidson ilike...@gmail.com 
 (mailto:ilike...@gmail.com) wrote:
  It is not a very good idea to save the results in the exact same place as 
  the data. Any failures during the job could lead to corrupted data, because 
  recomputing the lost partitions would involve reading the original 
  (now-nonexistent) data.
  
  As such, the only safe way to do this would be to do as you said, and 
  only delete the input data once the entire output has been successfully 
  created.
  
  
  On Sun, Jun 8, 2014 at 10:32 PM, innowireless TaeYun Kim 
  taeyun@innowireless.co.kr (mailto:taeyun@innowireless.co.kr) 
  wrote:
   Without (C), what is the best practice to implement the following 
   scenario?
   
   1. rdd = sc.textFile(FileA)
   2. rdd = rdd.map(...)  // actually modifying the rdd
   3. rdd.saveAsTextFile(FileA)
   
   Since the rdd transformation is 'lazy', rdd will not materialize until
   saveAsTextFile(), so FileA must still exist, but it must be deleted before
   saveAsTextFile().
   
   What I can think is:
   
   3. rdd.saveAsTextFile(TempFile)
   4. delete FileA
   5. rename TempFile to FileA
   
   This is not very convenient...
   
   Thanks.
   
   -Original Message-
   From: Patrick Wendell [mailto:pwend...@gmail.com]
   Sent: Tuesday, June 03, 2014 11:40 AM
   To: user@spark.apache.org (mailto:user@spark.apache.org)
   Subject: Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing
   file
   
   (A) Semantics in Spark 0.9 and earlier: Spark will ignore Hadoo's output
   format check and overwrite files in the destination directory.
   But it won't clobber the directory entirely. I.e. if the directory already
   had part1 part2 part3 part4 and you write a new job outputing only
   two files (part1, part2) then it would leave the other two files 
   intact,
   confusingly.
   
   (B) Semantics in Spark 1.0 and earlier: Runs Hadoop OutputFormat check 
   which
   means the directory must not exist already or an excpetion is thrown.
   
   (C) Semantics proposed by Nicholas Chammas in this thread (AFAIK):
   Spark will delete/clobber an existing destination directory if it exists,
   then fully over-write it with new data.
   
   I'm fine to add a flag that allows (B) for backwards-compatibility 
   reasons,
   but my point was I'd prefer not to have (C) even though I see some cases
   where it would be useful.
   
   - Patrick
   
   On Mon, Jun 2, 2014 at 4:25 PM, Sean Owen so...@cloudera.com 
   (mailto:so...@cloudera.com) wrote:
Is there a third way? Unless I miss something. Hadoop's OutputFormat
wants the target dir to not exist no matter what, so it's just a
question of whether Spark deletes it for you or errors.
   
On Tue, Jun 3, 2014 at 12:22 AM, Patrick Wendell pwend...@gmail.com 
(mailto:pwend...@gmail.com)
   wrote:
We can just add back a flag to make it backwards compatible - it was
just missed during the original PR.
   
Adding a *third* set of clobber semantics, I'm slightly -1 on that
for the following reasons:
   
1. It's scary to have Spark recursively deleting user files, could
easily lead to users deleting data by mistake if they don't
understand the exact semantics.
2. It would introduce a third set of semantics here for saveAsXX...
3. It's trivial for users to implement this with two lines of code
(if output dir exists, delete it) before calling saveAsHadoopFile.
   
- Patrick
   
   
  
 
 
 
 -- 
 Daniel Siegmann, Software Developer
 Velos
 Accelerating Machine Learning
 
 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
 E: daniel.siegm...@velos.io (mailto:daniel.siegm...@velos.io) W: www.velos.io 
 (http://www.velos.io)

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-12 Thread Nan Zhu

ah, I see,   

I think it’s hard to do something like fs.delete() in spark code (it’s scary as 
we discussed in the previous PR )

so if you want (C), I guess you have to do some delete work manually  

Best,  

--  
Nan Zhu


On Thursday, June 12, 2014 at 3:31 PM, Daniel Siegmann wrote:

 I do not want the behavior of (A) - that is dangerous and should only be 
 enabled to account for legacy code. Personally, I think this option should 
 eventually be removed.
  
 I want the option (C), to have Spark delete any existing part files before 
 creating any new output. I don't necessarily want this to be a global option, 
 but one on the API for saveTextFile (i.e. an additional boolean parameter).
  
 As it stands now, I need to precede every saveTextFile call with my own 
 deletion code.
  
 In other words, instead of writing ...
  
 if ( cleanOutput ) { MyUtil.clean(outputDir) }
 rdd.writeTextFile( outputDir )
  
 I'd like to write
  
 rdd.writeTextFile(outputDir, cleanOutput)
  
 Does that make sense?
  
  
  
  
 On Thu, Jun 12, 2014 at 2:51 PM, Nan Zhu zhunanmcg...@gmail.com 
 (mailto:zhunanmcg...@gmail.com) wrote:
  Actually this has been merged to the master branch  
   
  https://github.com/apache/spark/pull/947  
   
  --  
  Nan Zhu
   
   
  On Thursday, June 12, 2014 at 2:39 PM, Daniel Siegmann wrote:
   
   The old behavior (A) was dangerous, so it's good that (B) is now the 
   default. But in some cases I really do want to replace the old data, as 
   per (C). For example, I may rerun a previous computation (perhaps the 
   input data was corrupt and I'm rerunning with good input).

   Currently I have to write separate code to remove the files before 
   calling Spark. It would be very convenient if Spark could do this for me. 
   Has anyone created a JIRA issue to support (C)?


   On Mon, Jun 9, 2014 at 3:02 AM, Aaron Davidson ilike...@gmail.com 
   (mailto:ilike...@gmail.com) wrote:
It is not a very good idea to save the results in the exact same place 
as the data. Any failures during the job could lead to corrupted data, 
because recomputing the lost partitions would involve reading the 
original (now-nonexistent) data.
 
As such, the only safe way to do this would be to do as you said, and 
only delete the input data once the entire output has been successfully 
created.
 
 
On Sun, Jun 8, 2014 at 10:32 PM, innowireless TaeYun Kim 
taeyun@innowireless.co.kr (mailto:taeyun@innowireless.co.kr) 
wrote:
 Without (C), what is the best practice to implement the following 
 scenario?
  
 1. rdd = sc.textFile(FileA)
 2. rdd = rdd.map(...)  // actually modifying the rdd
 3. rdd.saveAsTextFile(FileA)
  
 Since the rdd transformation is 'lazy', rdd will not materialize until
 saveAsTextFile(), so FileA must still exist, but it must be deleted 
 before
 saveAsTextFile().
  
 What I can think is:
  
 3. rdd.saveAsTextFile(TempFile)
 4. delete FileA
 5. rename TempFile to FileA
  
 This is not very convenient...
  
 Thanks.
  
 -Original Message-
 From: Patrick Wendell [mailto:pwend...@gmail.com]
 Sent: Tuesday, June 03, 2014 11:40 AM
 To: user@spark.apache.org (mailto:user@spark.apache.org)
 Subject: Re: How can I make Spark 1.0 saveAsTextFile to overwrite 
 existing
 file
  
 (A) Semantics in Spark 0.9 and earlier: Spark will ignore Hadoo's 
 output
 format check and overwrite files in the destination directory.
 But it won't clobber the directory entirely. I.e. if the directory 
 already
 had part1 part2 part3 part4 and you write a new job outputing 
 only
 two files (part1, part2) then it would leave the other two files 
 intact,
 confusingly.
  
 (B) Semantics in Spark 1.0 and earlier: Runs Hadoop OutputFormat 
 check which
 means the directory must not exist already or an excpetion is thrown.
  
 (C) Semantics proposed by Nicholas Chammas in this thread (AFAIK):
 Spark will delete/clobber an existing destination directory if it 
 exists,
 then fully over-write it with new data.
  
 I'm fine to add a flag that allows (B) for backwards-compatibility 
 reasons,
 but my point was I'd prefer not to have (C) even though I see some 
 cases
 where it would be useful.
  
 - Patrick
  
 On Mon, Jun 2, 2014 at 4:25 PM, Sean Owen so...@cloudera.com 
 (mailto:so...@cloudera.com) wrote:
  Is there a third way? Unless I miss something. Hadoop's OutputFormat
  wants the target dir to not exist no matter what, so it's just a
  question of whether Spark deletes it for you or errors.
 
  On Tue, Jun 3, 2014 at 12:22 AM, Patrick Wendell 
  pwend...@gmail.com (mailto:pwend...@gmail.com)
 wrote:
  We can just add back a flag to make it backwards compatible

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Nan Zhu

Hi, Patrick,

I think https://issues.apache.org/jira/browse/SPARK-1677 is talking about the
same thing?

How about assigning it to me?

I think I missed the configuration part in my previous commit, though I
declared that in the PR description….

Best,

--
Nan Zhu

On Monday, June 2, 2014 at 3:03 PM, Patrick Wendell wrote:

Hey There,

The issue was that the old behavior could cause users to silently
overwrite data, which is pretty bad, so to be conservative we decided
to enforce the same checks that Hadoop does.

This was documented by this JIRA:
https://issues.apache.org/jira/browse/SPARK-1100
https://github.com/apache/spark/commit/3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1

However, it would be very easy to add an option that allows preserving
the old behavior. Is anyone here interested in contributing that? I
created a JIRA for it:

https://issues.apache.org/jira/browse/SPARK-1993

- Patrick

On Mon, Jun 2, 2014 at 9:22 AM, Pierre Borckmans
pierre.borckm...@realimpactanalytics.com
(mailto:pierre.borckm...@realimpactanalytics.com) wrote:
Indeed, the behavior has changed for good or for bad. I mean, I agree with
the danger you mention but I'm not sure it's happening like that. Isn't
there a mechanism for overwrite in Hadoop that automatically removes part
files, then writes a _temporary folder and then only the part files along
with the _success folder.

In any case this change of behavior should be documented IMO.

Cheers
Pierre

Message sent from a mobile device - excuse typos and abbreviations

Le 2 juin 2014 à 17:42, Nicholas Chammas nicholas.cham...@gmail.com
(mailto:nicholas.cham...@gmail.com) a
écrit :

What I've found using saveAsTextFile() against S3 (prior to Spark 1.0.0.) is
that files get overwritten automatically. This is one danger to this though.
If I save to a directory that already has 20 part- files, but this time
around I'm only saving 15 part- files, then there will be 5 leftover part-
files from the previous set mixed in with the 15 newer files. This is
potentially dangerous.

I haven't checked to see if this behavior has changed in 1.0.0. Are you
saying it has, Pierre?

On Mon, Jun 2, 2014 at 9:41 AM, Pierre B
[pierre.borckm...@realimpactanalytics.com](mailto:pierre.borckm...@realimpactanalytics.com)
wrote:

Hi Michaël,

Thanks for this. We could indeed do that.

But I guess the question is more about the change of behaviour from 0.9.1
to
1.0.0.
We never had to care about that in previous versions.

Does that mean we have to manually remove existing files or is there a way
to aumotically overwrite when using saveAsTextFile?

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-tp6696p6700.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
(http://Nabble.com).

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Nan Zhu

I made the PR, the problem is …after many rounds of review, that configuration 
part is missed….sorry about that  

I will fix it  

Best,  

--  
Nan Zhu


On Monday, June 2, 2014 at 5:13 PM, Pierre Borckmans wrote:

 I'm a bit confused because the PR mentioned by Patrick seems to adress all 
 these issues:
 https://github.com/apache/spark/commit/3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1
  
 Was it not accepted? Or is the description of this PR not completely 
 implemented?
  
 Message sent from a mobile device - excuse typos and abbreviations
  
 Le 2 juin 2014 à 23:08, Nicholas Chammas nicholas.cham...@gmail.com 
 (mailto:nicholas.cham...@gmail.com) a écrit :
  
  OK, thanks for confirming. Is there something we can do about that leftover 
  part- files problem in Spark, or is that for the Hadoop team?
   
   
  2014년 6월 2일 월요일, Aaron Davidsonilike...@gmail.com 
  (mailto:ilike...@gmail.com)님이 작성한 메시지:
   Yes.


   On Mon, Jun 2, 2014 at 1:23 PM, Nicholas Chammas 
   nicholas.cham...@gmail.com wrote:
So in summary:
As of Spark 1.0.0, saveAsTextFile() will no longer clobber by default.
There is an open JIRA issue to add an option to allow clobbering.
Even when clobbering, part- files may be left over from previous saves, 
which is dangerous.
 
Is this correct?
 
 
 
 
On Mon, Jun 2, 2014 at 4:17 PM, Aaron Davidson ilike...@gmail.com 
wrote:
 +1 please re-add this feature
  
  
 On Mon, Jun 2, 2014 at 12:44 PM, Patrick Wendell pwend...@gmail.com 
 wrote:
  Thanks for pointing that out. I've assigned you to SPARK-1677 (I 
  think
  I accidentally assigned myself way back when I created it). This
  should be an easy fix.
   
  On Mon, Jun 2, 2014 at 12:19 PM, Nan Zhu zhunanmcg...@gmail.com 
  wrote:
   Hi, Patrick,
  
   I think https://issues.apache.org/jira/browse/SPARK-1677 is 
   talking about
   the same thing?
  
   How about assigning it to me?
  
   I think I missed the configuration part in my previous commit, 
   though I
   declared that in the PR description
  
   Best,
  
   --
   Nan Zhu
  
   On Monday, June 2, 2014 at 3:03 PM, Patrick Wendell wrote:
  
   Hey There,
  
   The issue was that the old behavior could cause users to silently
   overwrite data, which is pretty bad, so to be conservative we 
   decided
   to enforce the same checks that Hadoop does.
  
   This was documented by this JIRA:
   https://issues.apache.org/jira/browse/SPARK-1100
   https://github.com/apache/spark/commit/3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1
  
   However, it would be very easy to add an option that allows 
   preserving
   the old behavior. Is anyone here interested in contributing that? 
   I
   created a JIRA for it:
  
   https://issues.apache.org/jira/browse/SPARK-1993
  
   - Patrick
  
   On Mon, Jun 2, 2014 at 9:22 AM, Pierre Borckmans
   pierre.borckm...@realimpactanalytics.com wrote:
  
   Indeed, the behavior has changed for good or for bad. I mean, I 
   agree with
   the danger you mention but I'm not sure it's happening like that. 
   Isn't
   there a mechanism for overwrite in Hadoop that automatically 
   removes part
   files, then writes a _temporary folder and then only the part 
   files along
   with the _success folder.
  
   In any case this change of behavior should be documented IMO.
  
   Cheers
   Pierre
  
   Message sent from a mobile device - excuse typos and abbreviations
  
   Le 2 juin 2014 à 17:42, Nicholas Chammas 
   nicholas.cham...@gmail.com a
   écrit :
  
   What I've found using saveAsTextFile() against S3 (prior to Spark 
   1.0.0.) is
   that files get overwritten automatically. This is one danger to 
   this though.
   If I save to a directory that already has 20 part- files, but 
   this time
   around I'm only saving 15 part- files, then there will be 5 
   leftover part-
   files from the previous set mixed in with the 15 newer files. 
   This is
   potentially dangerous.
  
   I haven't checked to see if this behavior has changed in 1.0.0. 
   Are you

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Nan Zhu

I remember that in the earlier version of that PR, I deleted files by calling 
HDFS API

we discussed and concluded that, it’s a bit scary to have something directly 
deleting user’s files in Spark

Best,  

--  
Nan Zhu


On Monday, June 2, 2014 at 10:39 PM, Patrick Wendell wrote:

 (A) Semantics in Spark 0.9 and earlier: Spark will ignore Hadoo's
 output format check and overwrite files in the destination directory.
 But it won't clobber the directory entirely. I.e. if the directory
 already had part1 part2 part3 part4 and you write a new job
 outputing only two files (part1, part2) then it would leave the
 other two files intact, confusingly.
  
 (B) Semantics in Spark 1.0 and earlier: Runs Hadoop OutputFormat check
 which means the directory must not exist already or an excpetion is
 thrown.
  
 (C) Semantics proposed by Nicholas Chammas in this thread (AFAIK):
 Spark will delete/clobber an existing destination directory if it
 exists, then fully over-write it with new data.
  
 I'm fine to add a flag that allows (B) for backwards-compatibility
 reasons, but my point was I'd prefer not to have (C) even though I see
 some cases where it would be useful.
  
 - Patrick
  
 On Mon, Jun 2, 2014 at 4:25 PM, Sean Owen so...@cloudera.com 
 (mailto:so...@cloudera.com) wrote:
  Is there a third way? Unless I miss something. Hadoop's OutputFormat
  wants the target dir to not exist no matter what, so it's just a
  question of whether Spark deletes it for you or errors.
   
  On Tue, Jun 3, 2014 at 12:22 AM, Patrick Wendell pwend...@gmail.com 
  (mailto:pwend...@gmail.com) wrote:
   We can just add back a flag to make it backwards compatible - it was
   just missed during the original PR.

   Adding a *third* set of clobber semantics, I'm slightly -1 on that
   for the following reasons:

   1. It's scary to have Spark recursively deleting user files, could
   easily lead to users deleting data by mistake if they don't understand
   the exact semantics.
   2. It would introduce a third set of semantics here for saveAsXX...
   3. It's trivial for users to implement this with two lines of code (if
   output dir exists, delete it) before calling saveAsHadoopFile.

   - Patrick

IllegelAccessError when writing to HBase?

2014-05-18 Thread Nan Zhu

Hi, all 

I tried to write data to HBase in a Spark-1.0 rc8  application, 

the application is terminated due to the java.lang.IllegalAccessError, Hbase 
shell works fine, and the same application works with a standalone Hbase 
deployment

java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteString 
at 
org.apache.hadoop.hbase.protobuf.RequestConverter.buildRegionSpecifier(RequestConverter.java:930)
at 
org.apache.hadoop.hbase.protobuf.RequestConverter.buildGetRowOrBeforeRequest(RequestConverter.java:133)
at 
org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore(ProtobufUtil.java:1466)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1236)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1110)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1067)
at 
org.apache.hadoop.hbase.client.AsyncProcess.findDestLocation(AsyncProcess.java:356)
at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:301)
at org.apache.hadoop.hbase.client.HTable.backgroundFlushCommits(HTable.java:955)
at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1239)
at org.apache.hadoop.hbase.client.HTable.close(HTable.java:1276)
at 
org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.close(TableOutputFormat.java:112)
at 
org.apache.spark.rdd.PairRDDFunctions.org$apache$spark$rdd$PairRDDFunctions$$writeShard$1(PairRDDFunctions.scala:720)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:730)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:730)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)


Can anyone give some hint to the issue?

Best, 

-- 
Nan Zhu

Re: IllegelAccessError when writing to HBase?

2014-05-18 Thread Nan Zhu

I tried hbase-0.96.2/0.98.1/0.98.2

HDFS version is 2.3 

-- 
Nan Zhu

On Sunday, May 18, 2014 at 4:18 PM, Nan Zhu wrote: 
 Hi, all 
 
 I tried to write data to HBase in a Spark-1.0 rc8  application, 
 
 the application is terminated due to the java.lang.IllegalAccessError, Hbase 
 shell works fine, and the same application works with a standalone Hbase 
 deployment
 
 java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteString 
 at 
 org.apache.hadoop.hbase.protobuf.RequestConverter.buildRegionSpecifier(RequestConverter.java:930)
 at 
 org.apache.hadoop.hbase.protobuf.RequestConverter.buildGetRowOrBeforeRequest(RequestConverter.java:133)
 at 
 org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore(ProtobufUtil.java:1466)
 at 
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1236)
 at 
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1110)
 at 
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1067)
 at 
 org.apache.hadoop.hbase.client.AsyncProcess.findDestLocation(AsyncProcess.java:356)
 at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:301)
 at 
 org.apache.hadoop.hbase.client.HTable.backgroundFlushCommits(HTable.java:955)
 at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1239)
 at org.apache.hadoop.hbase.client.HTable.close(HTable.java:1276)
 at 
 org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.close(TableOutputFormat.java:112)
 at org.apache.spark.rdd.PairRDDFunctions.org 
 (http://org.apache.spark.rdd.PairRDDFunctions.org)$apache$spark$rdd$PairRDDFunctions$$writeShard$1(PairRDDFunctions.scala:720)
 at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:730)
 at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:730)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 at org.apache.spark.scheduler.Task.run(Task.scala:51)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 
 
 Can anyone give some hint to the issue?
 
 Best, 
 
 -- 
 Nan Zhu

Re: Spark unit testing best practices

2014-05-16 Thread Nan Zhu

+1, at least with current code  

just watch the log printed by DAGScheduler…  

--  
Nan Zhu


On Wednesday, May 14, 2014 at 1:58 PM, Mark Hamstra wrote:

 serDe

Re: sbt run with spark.ContextCleaner ERROR

2014-05-15 Thread Nan Zhu

same problem +1, 

though does not change the program result 

-- 
Nan Zhu


On Tuesday, May 6, 2014 at 11:58 PM, Tathagata Das wrote:

 Okay, this needs to be fixed. Thanks for reporting this!
 
 
 
 On Mon, May 5, 2014 at 11:00 PM, wxhsdp wxh...@gmail.com 
 (mailto:wxh...@gmail.com) wrote:
  Hi, TD
  
  i tried on v1.0.0-rc3 and still got the error
  
  
  
  --
  View this message in context: 
  http://apache-spark-user-list.1001560.n3.nabble.com/sbt-run-with-spark-ContextCleaner-ERROR-tp5304p5421.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com 
  (http://Nabble.com).

Re: master attempted to re-register the worker and then took all workers as unregistered

2014-05-05 Thread Nan Zhu

Ah, I think this should be fixed in 0.9.1?  

Did you see the exception is thrown in the worker side?

Best, 

-- 
Nan Zhu


On Sunday, May 4, 2014 at 10:15 PM, Cheney Sun wrote:

 Hi Nan, 
 
 Have you found a way to fix the issue? Now I run into the same problem with
 version 0.9.1.
 
 Thanks,
 Cheney
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/master-attempted-to-re-register-the-worker-and-then-took-all-workers-as-unregistered-tp553p5341.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com 
 (http://Nabble.com).

Spark-1.0.0-rc3 compiled against Hadoop 2.3.0 cannot read HDFS 2.3.0?

2014-05-03 Thread Nan Zhu

)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:983)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:256)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:54)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



Anyone met the same issue before?

Best,

-- 
Nan Zhu

spark-0.9.1 compiled with Hadoop 2.3.0 doesn't work with S3?

2014-04-21 Thread Nan Zhu

Hi, all  

I’m writing a Spark application to load S3 data to HDFS,  

the HDFS version is 2.3.0, so I have to compile Spark with Hadoop 2.3.0

after I execute

val allfiles = sc.textFile(s3n://abc/*.txt”)

val output = allfiles.saveAsTextFile(hdfs://x.x.x.x:9000/dataset”)

Spark throws exception: (actually related to Hadoop?)

java.lang.NoClassDefFoundError: org/jets3t/service/ServiceException  
at 
org.apache.hadoop.fs.s3.S3FileSystem.createDefaultStore(S3FileSystem.java:100)
at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:90)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2316)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2350)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2332)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:891)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:741)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:692)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:574)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:900)
at $iwC$$iwC$$iwC$$iwC.init(console:14)
at $iwC$$iwC$$iwC.init(console:19)
at $iwC$$iwC.init(console:21)
at $iwC.init(console:23)
at init(console:25)
at .init(console:29)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:640)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:604)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:793)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:838)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:750)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:598)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:605)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:608)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:931)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:881)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:881)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:881)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:973)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
Caused by: java.lang.ClassNotFoundException: org.jets3t.service.ServiceException
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 63 more



Anyone else met the similar problem?

Best,  

--  
Nan Zhu

Re: spark-0.9.1 compiled with Hadoop 2.3.0 doesn't work with S3?

2014-04-21 Thread Nan Zhu

Yes, I fixed in the same way, but didn’t get a change to get back to here   

I also made a PR: https://github.com/apache/spark/pull/468

Best,  

--  
Nan Zhu


On Monday, April 21, 2014 at 8:19 PM, Parviz Deyhim wrote:

 I ran into the same issue. The problem seems to be with the jets3t library 
 that Spark uses in project/SparkBuild.scala.  
  
 change this:  
  
 net.java.dev.jets3t  % jets3t   % 0.7.1
  
 to  
  
 net.java.dev.jets3t  % jets3t   % 0.9.0
  
 0.7.1 is not the right version of jets3t for Hadoop 2.3.0
  
  
 On Mon, Apr 21, 2014 at 11:30 AM, Nan Zhu zhunanmcg...@gmail.com 
 (mailto:zhunanmcg...@gmail.com) wrote:
  Hi, all  
   
  I’m writing a Spark application to load S3 data to HDFS,  
   
  the HDFS version is 2.3.0, so I have to compile Spark with Hadoop 2.3.0
   
  after I execute  
   
  val allfiles = sc.textFile(s3n://abc/*.txt”)
   
  val output = allfiles.saveAsTextFile(hdfs://x.x.x.x:9000/dataset”)
   
  Spark throws exception: (actually related to Hadoop?)
   
   
  java.lang.NoClassDefFoundError: org/jets3t/service/ServiceException
   
   
  at 
  org.apache.hadoop.fs.s3.S3FileSystem.createDefaultStore(S3FileSystem.java:100)
   
   
  at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:90)
   
   
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2316)
   
   
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
   
   
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2350)
   
   
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2332)
   
   
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369)
   
   
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
   
   
  at 
  org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221)
   
   
  at 
  org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
   
   
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140)
   
   
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
   
   
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
   
   
  at scala.Option.getOrElse(Option.scala:120)
   
   
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
   
   
  at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
   
   
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
   
   
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
   
   
  at scala.Option.getOrElse(Option.scala:120)
   
   
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
   
   
  at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
   
   
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
   
   
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
   
   
  at scala.Option.getOrElse(Option.scala:120)
   
   
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
   
   
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:891)
   
   
  at 
  org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:741)
   
   
  at 
  org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:692)
   
   
  at 
  org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:574)
   
   
  at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:900)
   
   
  at $iwC$$iwC$$iwC$$iwC.init(console:14)
   
   
  at $iwC$$iwC$$iwC.init(console:19)
   
   
  at $iwC$$iwC.init(console:21)
   
   
  at $iwC.init(console:23)
   
   
  at init(console:25)
   
   
  at .init(console:29)
   
   
  at .clinit(console)
   
   
  at .init(console:7)
   
   
  at .clinit(console)
   
   
  at $print(console)
   
   
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   
   
  at 
  sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   
   
  at 
  sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   
   
  at java.lang.reflect.Method.invoke(Method.java:606)
   
   
  at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772)
   
   
  at 
  org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040)
   
   
  at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609)
   
   
  at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:640)
   
   
  at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:604)
   
   
  at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:793)
   
   
  at 
  org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:838)
   
   
  at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:750)
   
   
  at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:598)
   
   
  at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:605

Re: Only TraversableOnce?

2014-04-09 Thread Nan Zhu

Yeah, should be right 

-- 
Nan Zhu


On Wednesday, April 9, 2014 at 8:54 PM, wxhsdp wrote:

 thank you, it works
 after my operation over p, return p.toIterator, because mapPartitions has
 iterator return type, is that right?
 rdd.mapPartitions{D = {val p = D.toArray; ...; p.toIterator}}
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Only-TraversableOnce-tp3873p4043.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com 
 (http://Nabble.com).

Re: Only TraversableOnce?

2014-04-08 Thread Nan Zhu

so, the data structure looks like:

D consists of D1, D2, D3 (DX is partition)

and 

DX consists of d1, d2, d3 (dx is the part in your context)?

what you want to do is to transform 

DX to (d1 + d2, d1 + d3, d2 + d3)?



Best, 

-- 
Nan Zhu



On Tuesday, April 8, 2014 at 8:09 AM, wxhsdp wrote:

 In my application, data parts inside an RDD partition have ralations. so I
 need to do some operations beween them. 
 
 for example
 RDD T1 has several partitions, each partition has three parts A, B and C.
 then I transform T1 to T2. after transform, T2 also has three parts D, E and
 F, D = A+B, E = A+C, F = B+C. As far as I know, spark only supports
 operations traversing the RDD and calling a function for each element. how
 can I do such a transform?
 
 in hadoop I copy the data in each partition to a user defined buffer and do
 any operations I like in the buffer, finally I call output.collect() to emit
 the data. But how can I construct a new RDD with distributed partitions in
 spark? makeRDD only distributes a local Scala collection to form an RDD.
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Only-TraversableOnce-tp3873.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com 
 (http://Nabble.com).

Re: Only TraversableOnce?

2014-04-08 Thread Nan Zhu

If that’s the case, I think mapPartition is what you need, but it seems that 
you have to load the partition into the memory as whole by toArray

rdd.mapPartition{D = {val p = D.toArray; ...}}  

--  
Nan Zhu



On Tuesday, April 8, 2014 at 8:40 AM, wxhsdp wrote:

 yes, how can i do this conveniently? i can use filter, but there will be so
 many RDDs and it's not concise
  
  
  
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Only-TraversableOnce-tp3873p3875.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com 
 (http://Nabble.com).

Re: Why doesn't the driver node do any work?

2014-04-08 Thread Nan Zhu

may be unrelated to the question itself, just FYI

you can run your driver program in worker node with Spark-0.9

http://spark.apache.org/docs/latest/spark-standalone.html#launching-applications-inside-the-cluster

Best,

--
Nan Zhu

On Tuesday, April 8, 2014 at 5:11 PM, Nicholas Chammas wrote:

Alright, so I guess I understand now why spark-ec2 allows you to select
different instance types for the driver node and worker nodes. If the driver
node is just driving and not doing any large collect()s or heavy processing,
it can be much smaller than the worker nodes.

With regards to data locality, that may not be an issue in my usage pattern
if, in theory, I wanted to make the driver node also do work. I launch
clusters using spark-ec2 and source data from S3, so I'm missing out on that
data locality benefit from the get-go. The firewall may be an issue if
spark-ec2 doesn't punch open the appropriate holes. And it may well not,
since it doesn't seem to have an option to configure the driver node to also
do work.

Anyway, I'll definitely leave things the way they are. If I want a beefier
cluster, it's probably much easier to just launch a cluster with more slaves
using spark-ec2 than it is to set the driver node to a non-default
configuration.

On Tue, Apr 8, 2014 at 4:48 PM, Sean Owen so...@cloudera.com
(mailto:so...@cloudera.com) wrote:
If you want the machine that hosts the driver to also do work, you can
designate it as a worker too, if I'm not mistaken. I don't think the
driver should do work, logically, but, that's not to say that the
machine it's on shouldn't do work.
--
Sean Owen | Director, Data Science | London

On Tue, Apr 8, 2014 at 8:24 PM, Nicholas Chammas
nicholas.cham...@gmail.com (mailto:nicholas.cham...@gmail.com) wrote:
So I have a cluster in EC2 doing some work, and when I take a look here

http://driver-node:4040/executors/

I see that my driver node is snoozing on the job: No tasks, no memory
used,
and no RDD blocks cached.

I'm assuming that it was a conscious design choice not to have the driver
node partake in the cluster's workload.

Why is that? It seems like a wasted resource.

What's more, the slaves may rise up one day and overthrow the driver out
of
resentment.

Nick

View this message in context: Why doesn't the driver node do any work?
Sent from the Apache Spark User List mailing list archive at Nabble.com
(http://Nabble.com).

Re: Status of MLI?

2014-04-01 Thread Nan Zhu

mllib has been part of Spark distribution (under mllib directory), also check 
http://spark.apache.org/docs/latest/mllib-guide.html 

and for JIRA, because of the recent migration to apache JIRA, I think all 
mllib-related issues should be under the Spark umbrella, 
https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel
 

-- 
Nan Zhu


On Tuesday, April 1, 2014 at 10:38 PM, Krakna H wrote:

 What is the current development status of MLI/MLBase? I see that the github 
 repo is lying dormant (https://github.com/amplab/MLI) and JIRA has had no 
 activity in the last 30 days 
 (https://spark-project.atlassian.net/browse/MLI/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
  Is the plan to add a lot of this into mllib itself without needing a 
 separate API?
 
 Thanks!
 
 View this message in context: Status of MLI? 
 (http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610.html)
 Sent from the Apache Spark User List mailing list archive 
 (http://apache-spark-user-list.1001560.n3.nabble.com/) at Nabble.com 
 (http://Nabble.com).

Re: Status of MLI?

2014-04-01 Thread Nan Zhu

Ah, I see, I’m sorry, I didn’t read your email carefully   

then I have no idea about the progress on MLBase

Best,  

--  
Nan Zhu


On Tuesday, April 1, 2014 at 11:05 PM, Krakna H wrote:

 Hi Nan,
  
 I was actually referring to MLI/MLBase (http://www.mlbase.org); is this being 
 actively developed?
  
 I'm familiar with mllib and have been looking at its documentation.
  
 Thanks!
  
  
 On Tue, Apr 1, 2014 at 10:44 PM, Nan Zhu [via Apache Spark User List] 
 [hidden email] (/user/SendEmail.jtp?type=nodenode=3612i=0) wrote:
  mllib has been part of Spark distribution (under mllib directory), also 
  check http://spark.apache.org/docs/latest/mllib-guide.html  
   
  and for JIRA, because of the recent migration to apache JIRA, I think all 
  mllib-related issues should be under the Spark umbrella, 
  https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel

   
  --  
  Nan Zhu
   
   
  On Tuesday, April 1, 2014 at 10:38 PM, Krakna H wrote:
   
   
   What is the current development status of MLI/MLBase? I see that the 
   github repo is lying dormant (https://github.com/amplab/MLI) and JIRA has 
   had no activity in the last 30 days 
   (https://spark-project.atlassian.net/browse/MLI/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
Is the plan to add a lot of this into mllib itself without needing a 
   separate API?

   Thanks!

   View this message in context: Status of MLI? 
   (http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610.html)
   Sent from the Apache Spark User List mailing list archive 
   (http://apache-spark-user-list.1001560.n3.nabble.com/) at Nabble.com 
   (http://Nabble.com).
   
   
   
  If you reply to this email, your message will be added to the discussion 
  below: 
  http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3611.html

  To start a new topic under Apache Spark User List, email [hidden email] 
  (/user/SendEmail.jtp?type=nodenode=3612i=1)  
  To unsubscribe from Apache Spark User List, click here.
  NAML 
  (http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml)

  
 View this message in context: Re: Status of MLI? 
 (http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3612.html)
 Sent from the Apache Spark User List mailing list archive 
 (http://apache-spark-user-list.1001560.n3.nabble.com/) at Nabble.com 
 (http://Nabble.com).

Re: Distributed running in Spark Interactive shell

2014-03-26 Thread Nan Zhu

what do you mean by run across the cluster? 

you want to start the spark-shell across the cluster or you want to distribute 
tasks to multiple machines?

if the former case, yes, as long as you indicate the right master URL

if the later case, also yes, you can observe the distributed task in the Spark 
UI 

-- 
Nan Zhu


On Wednesday, March 26, 2014 at 8:54 AM, Sai Prasanna wrote:

 Is it possible to run across cluster using Spark Interactive Shell ?
 
 To be more explicit, is the procedure similar to running standalone 
 master-slave spark.
 
 I want to execute my code in  the interactive shell in the master-node, and 
 it should run across the cluster [say 5 node]. Is the procedure similar ???
 
 
 
 
 
 -- 
 Sai Prasanna. AN
 II M.Tech (CS), SSSIHL
 
 Entire water in the ocean can never sink a ship, Unless it gets inside.
 All the pressures of life can never hurt you, Unless you let them in.

Re: Distributed running in Spark Interactive shell

2014-03-26 Thread Nan Zhu

what you only need to do is ensure your spark cluster is running well, (you can 
check by access the Spark UI to see if all workers are displayed)

then, you have to set correct SPARK_MASTER_IP in the machine where you run 
spark-shell

The more details are :

when you run bin/spark-shell, it will start the driver program in that machine, 
interacting with the Master to start the application (in this case, it is 
spark-shell)

the Master tells Workers to start executors for your application, and the 
executors will try to register with your driver, 

then your driver can distribute tasks to the executors, i.e. run in a 
distributed fashion


Best, 

-- 
Nan Zhu


On Wednesday, March 26, 2014 at 9:01 AM, Sai Prasanna wrote:

 Nan Zhu, its the later, I want to distribute the tasks to the cluster 
 [machines available.]
 
 If i set the SPARK_MASTER_IP at the other machines and set the slaves-IP in 
 the /conf/slaves at the master node, will the interactive shell code run at 
 the master get distributed across multiple machines ??? 
 
 
  
 
 
 On Wed, Mar 26, 2014 at 6:32 PM, Nan Zhu zhunanmcg...@gmail.com 
 (mailto:zhunanmcg...@gmail.com) wrote:
  what do you mean by run across the cluster? 
  
  you want to start the spark-shell across the cluster or you want to 
  distribute tasks to multiple machines?
  
  if the former case, yes, as long as you indicate the right master URL 
  
  if the later case, also yes, you can observe the distributed task in the 
  Spark UI 
  
  -- 
  Nan Zhu
  
  
  On Wednesday, March 26, 2014 at 8:54 AM, Sai Prasanna wrote:
  
   Is it possible to run across cluster using Spark Interactive Shell ?
   
   To be more explicit, is the procedure similar to running standalone 
   master-slave spark. 
   
   I want to execute my code in  the interactive shell in the master-node, 
   and it should run across the cluster [say 5 node]. Is the procedure 
   similar ???
   
   
   
   
   
   -- 
   Sai Prasanna. AN
   II M.Tech (CS), SSSIHL
   
   Entire water in the ocean can never sink a ship, Unless it gets inside.
   All the pressures of life can never hurt you, Unless you let them in.
  
 
 
 
 -- 
 Sai Prasanna. AN
 II M.Tech (CS), SSSIHL
 
 Entire water in the ocean can never sink a ship, Unless it gets inside.
 All the pressures of life can never hurt you, Unless you let them in.

Re: Distributed running in Spark Interactive shell

2014-03-26 Thread Nan Zhu

master does more work than that actually, I just explained why he should set 
MASTER_IP correctly

a simplified list:

1. maintain the  worker status

2. maintain in-cluster driver status

3. maintain executor status (the worker tells master what happened on the 
executor, 



-- 
Nan Zhu



On Wednesday, March 26, 2014 at 9:46 AM, Yana Kadiyska wrote:

 Nan (or anyone who feels they understand the cluster architecture well), can 
 you clarify something for me. 
 
 From reading this user group and your explanation above it appears that the 
 cluster master is only involved in this during application startup -- to 
 allocate executors(from what you wrote sounds like the driver itself passes 
 the job/tasks to  the executors). From there onwards all computation is done 
 on the executors, who communicate results directly to the driver if certain 
 actions (say collect) are performed. Is that right? The only description of 
 the cluster I've seen came from here: 
 https://spark.apache.org/docs/0.9.0/cluster-overview.html but that picture 
 suggests there is no direct communication between driver and executors, which 
 I believe is wrong (unless I am misreading the picture -- I believe Master 
 and Cluster Manager refer to the same thing?). 
 
 The very short form of my question is, does the master do anything other than 
 executor allocation?
 
 
 On Wed, Mar 26, 2014 at 9:23 AM, Nan Zhu zhunanmcg...@gmail.com 
 (mailto:zhunanmcg...@gmail.com) wrote:
  what you only need to do is ensure your spark cluster is running well, (you 
  can check by access the Spark UI to see if all workers are displayed)
  
  then, you have to set correct SPARK_MASTER_IP in the machine where you run 
  spark-shell 
  
  The more details are :
  
  when you run bin/spark-shell, it will start the driver program in that 
  machine, interacting with the Master to start the application (in this 
  case, it is spark-shell) 
  
  the Master tells Workers to start executors for your application, and the 
  executors will try to register with your driver, 
  
  then your driver can distribute tasks to the executors, i.e. run in a 
  distributed fashion 
  
  
  Best, 
  
  -- 
  Nan Zhu
  
  
  On Wednesday, March 26, 2014 at 9:01 AM, Sai Prasanna wrote:
  
   Nan Zhu, its the later, I want to distribute the tasks to the cluster 
   [machines available.]
   
   If i set the SPARK_MASTER_IP at the other machines and set the slaves-IP 
   in the /conf/slaves at the master node, will the interactive shell code 
   run at the master get distributed across multiple machines ??? 
   
   

   
   
   On Wed, Mar 26, 2014 at 6:32 PM, Nan Zhu zhunanmcg...@gmail.com 
   (mailto:zhunanmcg...@gmail.com) wrote:
what do you mean by run across the cluster? 

you want to start the spark-shell across the cluster or you want to 
distribute tasks to multiple machines?

if the former case, yes, as long as you indicate the right master URL 

if the later case, also yes, you can observe the distributed task in 
the Spark UI 

-- 
Nan Zhu


On Wednesday, March 26, 2014 at 8:54 AM, Sai Prasanna wrote:

 Is it possible to run across cluster using Spark Interactive Shell ?
 
 To be more explicit, is the procedure similar to running standalone 
 master-slave spark. 
 
 I want to execute my code in  the interactive shell in the 
 master-node, and it should run across the cluster [say 5 node]. Is 
 the procedure similar ???
 
 
 
 
 
 -- 
 Sai Prasanna. AN
 II M.Tech (CS), SSSIHL
 
 Entire water in the ocean can never sink a ship, Unless it gets 
 inside.
 All the pressures of life can never hurt you, Unless you let them in.

   
   
   
   -- 
   Sai Prasanna. AN
   II M.Tech (CS), SSSIHL
   
   Entire water in the ocean can never sink a ship, Unless it gets inside.
   All the pressures of life can never hurt you, Unless you let them in.

Re: Distributed running in Spark Interactive shell

2014-03-26 Thread Nan Zhu

and, yes, I think that picture is a bit misleading, though in the following 
paragraph it has mentioned that  

“
Because the driver schedules tasks on the cluster, it should be run close to 
the worker nodes, preferably on the same local area network. If you’d like to 
send requests to the cluster remotely, it’s better to open an RPC to the driver 
and have it submit operations from nearby than to run a driver far away from 
the worker nodes.


--  
Nan Zhu


On Wednesday, March 26, 2014 at 9:59 AM, Nan Zhu wrote:

 master does more work than that actually, I just explained why he should set 
 MASTER_IP correctly
  
 a simplified list:
  
 1. maintain the  worker status
  
 2. maintain in-cluster driver status
  
 3. maintain executor status (the worker tells master what happened on the 
 executor,  
  
  
  
 --  
 Nan Zhu
  
  
  
 On Wednesday, March 26, 2014 at 9:46 AM, Yana Kadiyska wrote:
  
  Nan (or anyone who feels they understand the cluster architecture well), 
  can you clarify something for me.  
   
  From reading this user group and your explanation above it appears that the 
  cluster master is only involved in this during application startup -- to 
  allocate executors(from what you wrote sounds like the driver itself passes 
  the job/tasks to  the executors). From there onwards all computation is 
  done on the executors, who communicate results directly to the driver if 
  certain actions (say collect) are performed. Is that right? The only 
  description of the cluster I've seen came from here: 
  https://spark.apache.org/docs/0.9.0/cluster-overview.html but that picture 
  suggests there is no direct communication between driver and executors, 
  which I believe is wrong (unless I am misreading the picture -- I believe 
  Master and Cluster Manager refer to the same thing?).  
   
  The very short form of my question is, does the master do anything other 
  than executor allocation?
   
   
  On Wed, Mar 26, 2014 at 9:23 AM, Nan Zhu zhunanmcg...@gmail.com 
  (mailto:zhunanmcg...@gmail.com) wrote:
   what you only need to do is ensure your spark cluster is running well, 
   (you can check by access the Spark UI to see if all workers are displayed)

   then, you have to set correct SPARK_MASTER_IP in the machine where you 
   run spark-shell  

   The more details are :

   when you run bin/spark-shell, it will start the driver program in that 
   machine, interacting with the Master to start the application (in this 
   case, it is spark-shell)  

   the Master tells Workers to start executors for your application, and the 
   executors will try to register with your driver,  

   then your driver can distribute tasks to the executors, i.e. run in a 
   distributed fashion  


   Best,  

   --  
   Nan Zhu


   On Wednesday, March 26, 2014 at 9:01 AM, Sai Prasanna wrote:

Nan Zhu, its the later, I want to distribute the tasks to the cluster 
[machines available.]
 
If i set the SPARK_MASTER_IP at the other machines and set the 
slaves-IP in the /conf/slaves at the master node, will the interactive 
shell code run at the master get distributed across multiple machines 
???  
 
 
  
 
 
On Wed, Mar 26, 2014 at 6:32 PM, Nan Zhu zhunanmcg...@gmail.com 
(mailto:zhunanmcg...@gmail.com) wrote:
 what do you mean by run across the cluster?  
  
 you want to start the spark-shell across the cluster or you want to 
 distribute tasks to multiple machines?
  
 if the former case, yes, as long as you indicate the right master URL 
  
  
 if the later case, also yes, you can observe the distributed task in 
 the Spark UI  
  
 --  
 Nan Zhu
  
  
 On Wednesday, March 26, 2014 at 8:54 AM, Sai Prasanna wrote:
  
  Is it possible to run across cluster using Spark Interactive Shell ?
   
  To be more explicit, is the procedure similar to running standalone 
  master-slave spark.  
   
  I want to execute my code in  the interactive shell in the 
  master-node, and it should run across the cluster [say 5 node]. Is 
  the procedure similar ???
   
   
   
   
   
  --  
  Sai Prasanna. AN
  II M.Tech (CS), SSSIHL
   
  Entire water in the ocean can never sink a ship, Unless it gets 
  inside.
  All the pressures of life can never hurt you, Unless you let them 
  in.
  
 
 
 
--  
Sai Prasanna. AN
II M.Tech (CS), SSSIHL
 
Entire water in the ocean can never sink a ship, Unless it gets inside.
All the pressures of life can never hurt you, Unless you let them in.

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Nan Zhu

Hi, Diana,   

See my inlined answer  

--  
Nan Zhu



On Monday, March 24, 2014 at 3:44 PM, Diana Carroll wrote:

 Has anyone successfully followed the instructions on the Quick Start page of 
 the Spark home page to run a standalone Scala application?  I can't, and I 
 figure I must be missing something obvious!
  
 I'm trying to follow the instructions here as close to word for word as 
 possible:
 http://spark.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala
  
 1.  The instructions don't say what directory to create my test application 
 in, but later I'm instructed to run sbt/sbt so I conclude that my working 
 directory must be $SPARK_HOME.  (Temporarily ignoring that it is a little 
 weird to be working directly in the Spark distro.)

You can create your application in any directory, just follow the sbt project 
dir structure
  
 2.  Create $SPARK_HOME/mysparktest/src/main/scala/SimpleApp.scala.  
 Copypaste in the code from the instructions exactly, replacing 
 YOUR_SPARK_HOME with my spark home path.

should be correct  
  
 3.  Create $SPARK_HOME/mysparktest/simple.sbt.  Copypaste in the sbt file 
 from the instructions

should be correct   
  
 4.  From the $SPARK_HOME I run sbt/sbt package.  It runs through the ENTIRE 
 Spark project!  This takes several minutes, and at the end, it says Done 
 packaging.  unfortunately, there's nothing in the $SPARK_HOME/mysparktest/ 
 folder other than what I already had there.   

because you are in Spark directory, don’t need to do that actually , the 
dependency on Spark is resolved by sbt
  
  
 (Just for fun, I also did what I thought was more logical, which is set my 
 working directory to $SPARK_HOME/mysparktest, and but $SPARK_HOME/sbt/sbt 
 package, but that was even less successful: I got an error:  
 awk: cmd. line:1: fatal: cannot open file `./project/build.properties' for 
 reading (No such file or directory)
 Attempting to fetch sbt
 /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or 
 directory
 /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or 
 directory
 Our attempt to download sbt locally to sbt/sbt-launch-.jar failed. Please 
 install sbt manually from http://www.scala-sbt.org/
  
  
  
 So, help?  I'm sure these instructions work because people are following them 
 every day, but I can't tell what they are supposed to do.   
  
 Thanks!  
 Diana

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Nan Zhu

Hi, Diana,   

You don’t need to use spark-distributed sbt

just download sbt from its official website and set your PATH to the right place

Best,  

--  
Nan Zhu



On Monday, March 24, 2014 at 4:30 PM, Diana Carroll wrote:

 Yeah, that's exactly what I did. Unfortunately it doesn't work:
  
 $SPARK_HOME/sbt/sbt package
 awk: cmd. line:1: fatal: cannot open file `./project/build.properties' for 
 reading (No such file or directory)
 Attempting to fetch sbt
 /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or 
 directory
 /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or 
 directory
 Our attempt to download sbt locally to sbt/sbt-launch-.jar failed. Please 
 install sbt manually from http://www.scala-sbt.org/
  
  
  
  
 On Mon, Mar 24, 2014 at 4:25 PM, Ognen Duzlevski og...@plainvanillagames.com 
 (mailto:og...@plainvanillagames.com) wrote:
  You can use any sbt on your machine, including the one that comes with 
  spark. For example, try:
   
  ~/path_to_spark/sbt/sbt compile
  ~/path_to_spark/sbt/sbt run arguments
   
  Or you can just add that to your PATH by:
   
  export $PATH=$PATH:~/path_to_spark/sbt
   
  To make it permanent, you can add it to your ~/.bashrc or ~/.bash_profile 
  or ??? depending on the system you are using. If you are on Windows, sorry, 
  I can't offer any help there ;)
   
  Ognen
   
   
  On 3/24/14, 3:16 PM, Diana Carroll wrote:
   Thanks Ongen.  

   Unfortunately I'm not able to follow your instructions either.  In 
   particular:
 
sbt compile
sbt run arguments if any

   This doesn't work for me because there's no program on my path called 
   sbt.  The instructions in the Quick Start guide are specific that I 
   should call $SPARK_HOME/sbt/sbt.  I don't have any other executable on 
   my system called sbt.  

   Did you download and install sbt separately?  In following the Quick 
   Start guide, that was not stated as a requirement, and I'm trying to run 
   through the guide word for word.  

   Diana  


   On Mon, Mar 24, 2014 at 4:12 PM, Ognen Duzlevski 
   og...@plainvanillagames.com (mailto:og...@plainvanillagames.com) wrote:
Diana,
 
Anywhere on the filesystem you have read/write access (you need not be 
in your spark home directory):
 
mkdir myproject
cd myproject
mkdir project
mkdir target
mkdir -p src/main/scala
cp $mypath/$mymysource.scala src/main/scala/
cp $mypath/myproject.sbt .
 
Make sure that myproject.sbt has the following in it:
 
name := I NEED A NAME!
 
version := I NEED A VERSION!
 
scalaVersion := 2.10.3
 
libraryDependencies += org.apache.spark % spark-core_2.10 % 
0.9.0-incubating
 
If you will be using Hadoop/HDFS functionality you will need the below 
line also
 
libraryDependencies += org.apache.hadoop % hadoop-client % 2.2.0
 
The above assumes you are using Spark 0.9 and Scala 2.10.3. If you are 
using 0.8.1 - adjust appropriately.
 
That's it. Now you can do
 
sbt compile
sbt run arguments if any
 
You can also do
sbt package to produce a jar file of your code which you can then add 
to the SparkContext at runtime.
 
In a more complicated project you may need to have a bit more involved 
hierarchy like com.github.dianacarroll which will then translate to 
src/main/scala/com/github/dianacarroll/ where you can put your multiple 
.scala files which will then have to be a part of a package 
com.github.dianacarroll (you can just put that as your first line in 
each of these scala files). I am new to Java/Scala so this is how I do 
it. More educated Java/Scala programmers may tell you otherwise ;)
 
You can get more complicated with the sbt project subrirectory but you 
can read independently about sbt and what it can do, above is the bare 
minimum.
 
Let me know if that helped.
Ognen  
 
 
On 3/24/14, 2:44 PM, Diana Carroll wrote:
 Has anyone successfully followed the instructions on the Quick Start 
 page of the Spark home page to run a standalone Scala application?  
 I can't, and I figure I must be missing something obvious!
  
 I'm trying to follow the instructions here as close to word for 
 word as possible:
 http://spark.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala
  
 1.  The instructions don't say what directory to create my test 
 application in, but later I'm instructed to run sbt/sbt so I 
 conclude that my working directory must be $SPARK_HOME.  (Temporarily 
 ignoring that it is a little weird to be working directly in the 
 Spark distro.)
  
 2.  Create $SPARK_HOME/mysparktest/src/main/scala/SimpleApp.scala.  
 Copypaste in the code from the instructions exactly, replacing 
 YOUR_SPARK_HOME with my spark home path.
  
 3

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Nan Zhu

I found that I never read the document carefully and I never find that Spark
document is suggesting you to use Spark-distributed sbt……

Best,

--
Nan Zhu

On Monday, March 24, 2014 at 5:47 PM, Diana Carroll wrote:

Thanks for your help, everyone. Several folks have explained that I can
surely solve the problem by installing sbt.

But I'm trying to get the instructions working as written on the Spark
website. The instructions not only don't have you install sbt
separately...they actually specifically have you use the sbt that is
distributed with Spark.

If it is not possible to build your own Spark programs with Spark-distributed
sbt, then that's a big hole in the Spark docs that I shall file. And if the
sbt that is included with Spark is MEANT to be able to compile your own Spark
apps, then that's a product bug.

But before I file the bug, I'm still hoping I'm missing something, and
someone will point out that I'm missing a small step that will make the Spark
distribution of sbt work!

Diana

On Mon, Mar 24, 2014 at 4:52 PM, Yana Kadiyska yana.kadiy...@gmail.com
(mailto:yana.kadiy...@gmail.com) wrote:
Diana, I just tried it on a clean Ubuntu machine, with Spark 0.8
(since like other folks I had sbt preinstalled on my usual machine)

I ran the command exactly as Ognen suggested and see
Set current project to Simple Project (do you see this -- you should
at least be seeing this)
and then a bunch of Resolving ...

messages. I did get an error there, saying it can't find
javax.servlet.orbit. I googled the error and found this thread:

http://mail-archives.apache.org/mod_mbox/spark-user/201309.mbox/%3ccajbo4nexyzqe6zgreqjtzzz5zrcoavfen+wmbyced6n1epf...@mail.gmail.com%3E

adding the IvyXML fragment they suggested helped in my case (but
again, the build pretty clearly complained).

If you're still having no luck, I suggest installing sbt and setting
SBT_HOME... http://www.scala-sbt.org/

In either case though, it's not a Spark-specific issue...Hopefully
some of all this helps.

On Mon, Mar 24, 2014 at 4:30 PM, Diana Carroll dcarr...@cloudera.com
(mailto:dcarr...@cloudera.com) wrote:
Yeah, that's exactly what I did. Unfortunately it doesn't work:

$SPARK_HOME/sbt/sbt package
awk: cmd. line:1: fatal: cannot open file `./project/build.properties' for
reading (No such file or directory)
Attempting to fetch sbt
/usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
directory
/usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
directory
Our attempt to download sbt locally to sbt/sbt-launch-.jar failed. Please
install sbt manually from http://www.scala-sbt.org/

On Mon, Mar 24, 2014 at 4:25 PM, Ognen Duzlevski
og...@plainvanillagames.com (mailto:og...@plainvanillagames.com) wrote:

You can use any sbt on your machine, including the one that comes with
spark. For example, try:

~/path_to_spark/sbt/sbt compile
~/path_to_spark/sbt/sbt run arguments

Or you can just add that to your PATH by:

export $PATH=$PATH:~/path_to_spark/sbt

To make it permanent, you can add it to your ~/.bashrc or ~/.bash_profile
or ??? depending on the system you are using. If you are on Windows,
sorry,
I can't offer any help there ;)

Ognen

On 3/24/14, 3:16 PM, Diana Carroll wrote:

Thanks Ongen.

Unfortunately I'm not able to follow your instructions either. In
particular:

sbt compile
sbt run arguments if any

This doesn't work for me because there's no program on my path called
sbt. The instructions in the Quick Start guide are specific that I
should
call $SPARK_HOME/sbt/sbt. I don't have any other executable on my
system
called sbt.

Did you download and install sbt separately? In following the Quick
Start
guide, that was not stated as a requirement, and I'm trying to run
through
the guide word for word.

Diana

On Mon, Mar 24, 2014 at 4:12 PM, Ognen Duzlevski
og...@plainvanillagames.com (mailto:og...@plainvanillagames.com) wrote:

Diana,

Anywhere on the filesystem you have read/write access (you need not be
in
your spark home directory):

mkdir myproject
cd myproject
mkdir project
mkdir target
mkdir -p src/main/scala
cp $mypath/$mymysource.scala src/main/scala/
cp $mypath/myproject.sbt .

Make sure that myproject.sbt has the following in it:

name := I NEED A NAME!

version := I NEED A VERSION!

scalaVersion := 2.10.3

libraryDependencies += org.apache.spark % spark-core_2.10 %
0.9.0-incubating

If you will be using Hadoop/HDFS functionality you will need the below
line also

libraryDependencies += org.apache.hadoop % hadoop-client % 2.2.0

The above assumes you are using Spark 0.9 and Scala 2.10.3. If you are
using 0.8.1

Re: Splitting RDD and Grouping together to perform computation

2014-03-24 Thread Nan Zhu

partition your input into even number partitions  

use mapPartition to operate on Iterator[Int]

maybe there are some more efficient way….

Best,  

--  
Nan Zhu



On Monday, March 24, 2014 at 7:59 PM, yh18190 wrote:

 Hi, I have large data set of numbers ie RDD and wanted to perform a 
 computation only on groupof two values at a time. For example 
 1,2,3,4,5,6,7... is an RDD Can i group the RDD into (1,2),(3,4),(5,6)...?? 
 and perform the respective computations ?in an efficient manner? As we do'nt 
 have a way to index elements directly using forloop etc..(i,i+1)...is their 
 way to resolve this problem? Please suggest me ..i would be really thankful 
 to you  
 View this message in context: Splitting RDD and Grouping together to perform 
 computation 
 (http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153.html)
 Sent from the Apache Spark User List mailing list archive 
 (http://apache-spark-user-list.1001560.n3.nabble.com/) at Nabble.com 
 (http://Nabble.com).

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Nan Zhu

Yes, actually even for spark, I mostly use the sbt I installed…..so always
missing this issue….

If you can reproduce the problem with a spark-distribtued sbt…I suggest
proposing a PR to fix the document, before 0.9.1 is officially released

Best,

--
Nan Zhu

On Monday, March 24, 2014 at 8:34 PM, Diana Carroll wrote:

It is suggested implicitly in giving you the command ./sbt/sbt. The
separately installed sbt isn't in a folder called sbt, whereas Spark's
version is. And more relevantly, just a few paragraphs earlier in the
tutorial you execute the command sbt/sbt assembly which definitely refers
to the spark install.

On Monday, March 24, 2014, Nan Zhu zhunanmcg...@gmail.com
(mailto:zhunanmcg...@gmail.com) wrote:
I found that I never read the document carefully and I never find that
Spark document is suggesting you to use Spark-distributed sbt……

Best,

--
Nan Zhu

On Monday, March 24, 2014 at 5:47 PM, Diana Carroll wrote:

Thanks for your help, everyone. Several folks have explained that I can
surely solve the problem by installing sbt.

If it is not possible to build your own Spark programs with
Spark-distributed sbt, then that's a big hole in the Spark docs that I
shall file. And if the sbt that is included with Spark is MEANT to be
able to compile your own Spark apps, then that's a product bug.

But before I file the bug, I'm still hoping I'm missing something, and
someone will point out that I'm missing a small step that will make the
Spark distribution of sbt work!

Diana

On Mon, Mar 24, 2014 at 4:52 PM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
Diana, I just tried it on a clean Ubuntu machine, with Spark 0.8
(since like other folks I had sbt preinstalled on my usual machine)

I ran the command exactly as Ognen suggested and see
Set current project to Simple Project (do you see this -- you should
at least be seeing this)
and then a bunch of Resolving ...

messages. I did get an error there, saying it can't find
javax.servlet.orbit. I googled the error and found this thread:

http://mail-archives.apache.org/mod_mbox/spark-user/201309.mbox/%3ccajbo4nexyzqe6zgreqjtzzz5zrcoavfen+wmbyced6n1epf...@mail.gmail.com%3E

adding the IvyXML fragment they suggested helped in my case (but
again, the build pretty clearly complained).

If you're still having no luck, I suggest installing sbt and setting
SBT_HOME... http://www.scala-sbt.org/

In either case though, it's not a Spark-specific issue...Hopefully
some of all this helps.

On Mon, Mar 24, 2014 at 4:30 PM, Diana Carroll dcarr...@cloudera.com
wrote:
Yeah, that's exactly what I did. Unfortunately it doesn't work:

$SPARK_HOME/sbt/sbt package
awk: cmd. line:1: fatal: cannot open file
`./project/build.properties' for
reading (No such file or directory)
Attempting to fetch sbt
/usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
directory
/usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
directory
Our attempt to download sbt locally to sbt/sbt-launch-.jar failed.
Please
install sbt manually from http://www.scala-sbt.org/

On Mon, Mar 24, 2014 at 4:25 PM, Ognen Duzlevski
og...@plainvanillagames.com wrote:

You can use any sbt on your machine, including the one that comes
with
spark. For example, try:

~/path_to_spark/sbt/sbt compile
~/path_to_spark/sbt/sbt run arguments

Or you can just add that to your PATH by:

export $PATH=$PATH:~/path_to_spark/sbt

To make it permanent, you can add it to your ~/.bashrc or
~/.bash_profile
or ??? depending on the system you are using. If you are on Windows,
sorry,
I can't offer any help there ;)

Ognen

On 3/24/14, 3:16 PM, Diana Carroll wrote:

Thanks Ongen.

Unfortunately I'm not able to follow your instructions either. In
particular:

sbt compile
sbt run arguments if any

This doesn't work for me because there's no program on my path called
sbt. The instructions in the Quick Start guide are specific that
I sho

Re: Splitting RDD and Grouping together to perform computation

2014-03-24 Thread Nan Zhu

I didn’t group the integers, but process them in group of two,   

partition that

scala val a = sc.parallelize(List(1, 2, 3, 4), 2)  
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at 
console:12


process each partition and process elements in the partition in group of 2

scala a.mapPartitions(p = {val l = p.toList;   
 | val ret = new ListBuffer[Int]
 | for (i - 0 until l.length by 2) {
 | ret += l(i) + l(i + 1)
 | }
 | ret.toList.iterator
 | }
 | )
res7: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at mapPartitions at 
console:16



scala res7.collect

res10: Array[Int] = Array(3, 7)

Best,

--  
Nan Zhu



On Monday, March 24, 2014 at 8:40 PM, Nan Zhu wrote:

 partition your input into even number partitions  
  
 use mapPartition to operate on Iterator[Int]
  
 maybe there are some more efficient way….
  
 Best,  
  
 --  
 Nan Zhu
  
  
  
 On Monday, March 24, 2014 at 7:59 PM, yh18190 wrote:
  
  Hi, I have large data set of numbers ie RDD and wanted to perform a 
  computation only on groupof two values at a time. For example 
  1,2,3,4,5,6,7... is an RDD Can i group the RDD into (1,2),(3,4),(5,6)...?? 
  and perform the respective computations ?in an efficient manner? As we 
  do'nt have a way to index elements directly using forloop etc..(i,i+1)...is 
  their way to resolve this problem? Please suggest me ..i would be really 
  thankful to you  
  View this message in context: Splitting RDD and Grouping together to 
  perform computation 
  (http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153.html)
  Sent from the Apache Spark User List mailing list archive 
  (http://apache-spark-user-list.1001560.n3.nabble.com/) at Nabble.com 
  (http://Nabble.com).

81 matches

Mail list logo