from:"Xi Shen"

Re: spark pi example fail on yarn

2016-10-21 Thread Xi Shen

I see, I had this issue before. I think you are using Java 8, right?
Because Java 8 JVM requires more bootstrap heap memory.

Turning off the memory check is an unsafe way to avoid this issue. I think
it is better to increase the memory ratio, like this:

  
yarn.nodemanager.vmem-pmem-ratio
3.15
  


On Fri, Oct 21, 2016 at 11:15 AM Li Li <fancye...@gmail.com> wrote:

I modified yarn-site.xml yarn.nodemanager.vmem-check-enabled to false
and it works for yarn-client and spark-shell

On Fri, Oct 21, 2016 at 10:59 AM, Li Li <fancye...@gmail.com> wrote:
> I found a warn in nodemanager log. is the virtual memory exceed? how
> should I config yarn to solve this problem?
>
> 2016-10-21 10:41:12,588 INFO
>
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
> Memory usage of ProcessTree 20299 for container-id
> container_1477017445921_0001_02_01: 335.1 MB of 1 GB physical
> memory used; 2.2 GB of 2.1 GB virtual memory used
> 2016-10-21 10:41:12,589 WARN
>
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
> Process tree for container: container_1477017445921_0001_02_01 has
> processes older than 1 iteration running over the configured limit.
> Limit=2254857728, current usage = 2338873344
>
> On Fri, Oct 21, 2016 at 8:49 AM, Saisai Shao <sai.sai.s...@gmail.com>
wrote:
>> It is not Spark has difficulty to communicate with YARN, it simply means
AM
>> is exited with FINISHED state.
>>
>> I'm guessing it might be related to memory constraints for container,
please
>> check the yarn RM and NM logs to find out more details.
>>
>> Thanks
>> Saisai
>>
>> On Fri, Oct 21, 2016 at 8:14 AM, Xi Shen <davidshe...@gmail.com> wrote:
>>>
>>> 16/10/20 18:12:14 ERROR cluster.YarnClientSchedulerBackend: Yarn
>>> application has already exited with state FINISHED!
>>>
>>>  From this, I think it is spark has difficult communicating with YARN.
You
>>> should check your Spark log.
>>>
>>>
>>> On Fri, Oct 21, 2016 at 8:06 AM Li Li <fancye...@gmail.com> wrote:
>>>>
>>>> which log file should I
>>>>
>>>> On Thu, Oct 20, 2016 at 10:02 PM, Saisai Shao <sai.sai.s...@gmail.com>
>>>> wrote:
>>>> > Looks like ApplicationMaster is killed by SIGTERM.
>>>> >
>>>> > 16/10/20 18:12:04 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL TERM
>>>> > 16/10/20 18:12:04 INFO yarn.ApplicationMaster: Final app status:
>>>> >
>>>> > This container may be killed by yarn NodeManager or other processes,
>>>> > you'd
>>>> > better check yarn log to dig out more details.
>>>> >
>>>> > Thanks
>>>> > Saisai
>>>> >
>>>> > On Thu, Oct 20, 2016 at 6:51 PM, Li Li <fancye...@gmail.com> wrote:
>>>> >>
>>>> >> I am setting up a small yarn/spark cluster. hadoop/yarn version is
>>>> >> 2.7.3 and I can run wordcount map-reduce correctly in yarn.
>>>> >> And I am using  spark-2.0.1-bin-hadoop2.7 using command:
>>>> >> ~/spark-2.0.1-bin-hadoop2.7$ ./bin/spark-submit --class
>>>> >> org.apache.spark.examples.SparkPi --master yarn-client
>>>> >> examples/jars/spark-examples_2.11-2.0.1.jar 1
>>>> >> it fails and the first error is:
>>>> >> 16/10/20 18:12:03 INFO storage.BlockManagerMaster: Registered
>>>> >> BlockManager BlockManagerId(driver, 10.161.219.189, 39161)
>>>> >> 16/10/20 18:12:03 INFO handler.ContextHandler: Started
>>>> >> o.s.j.s.ServletContextHandler@76ad6715{/metrics/json,null,AVAILABLE}
>>>> >> 16/10/20 18:12:12 INFO
>>>> >> cluster.YarnSchedulerBackend$YarnSchedulerEndpoint:
ApplicationMaster
>>>> >> registered as NettyRpcEndpointRef(null)
>>>> >> 16/10/20 18:12:12 INFO cluster.YarnClientSchedulerBackend: Add WebUI
>>>> >> Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter,
>>>> >> Map(PROXY_HOSTS -> ai-hz1-spark1, PROXY_URI_BASES ->
>>>> >> http://ai-hz1-spark1:8088/proxy/application_1476957324184_0002),
>>>> >> /proxy/application_1476957324184_0002
>>>> >> 16/10/20 18:12:12 INFO ui.JettyUtils: Adding filter:
>>>> >> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
>>>> >> 16/10/20 18:12:12 INFO cluster.YarnClientSchedulerBackend:
>>>> >>

Re: spark pi example fail on yarn

2016-10-20 Thread Xi Shen

16/10/20 18:12:14 ERROR cluster.YarnClientSchedulerBackend: Yarn
application has already exited with state FINISHED!

 From this, I think it is spark has difficult communicating with YARN. You
should check your Spark log.


On Fri, Oct 21, 2016 at 8:06 AM Li Li  wrote:

which log file should I

On Thu, Oct 20, 2016 at 10:02 PM, Saisai Shao 
wrote:
> Looks like ApplicationMaster is killed by SIGTERM.
>
> 16/10/20 18:12:04 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL TERM
> 16/10/20 18:12:04 INFO yarn.ApplicationMaster: Final app status:
>
> This container may be killed by yarn NodeManager or other processes, you'd
> better check yarn log to dig out more details.
>
> Thanks
> Saisai
>
> On Thu, Oct 20, 2016 at 6:51 PM, Li Li  wrote:
>>
>> I am setting up a small yarn/spark cluster. hadoop/yarn version is
>> 2.7.3 and I can run wordcount map-reduce correctly in yarn.
>> And I am using  spark-2.0.1-bin-hadoop2.7 using command:
>> ~/spark-2.0.1-bin-hadoop2.7$ ./bin/spark-submit --class
>> org.apache.spark.examples.SparkPi --master yarn-client
>> examples/jars/spark-examples_2.11-2.0.1.jar 1
>> it fails and the first error is:
>> 16/10/20 18:12:03 INFO storage.BlockManagerMaster: Registered
>> BlockManager BlockManagerId(driver, 10.161.219.189, 39161)
>> 16/10/20 18:12:03 INFO handler.ContextHandler: Started
>> o.s.j.s.ServletContextHandler@76ad6715{/metrics/json,null,AVAILABLE}
>> 16/10/20 18:12:12 INFO
>> cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster
>> registered as NettyRpcEndpointRef(null)
>> 16/10/20 18:12:12 INFO cluster.YarnClientSchedulerBackend: Add WebUI
>> Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter,
>> Map(PROXY_HOSTS -> ai-hz1-spark1, PROXY_URI_BASES ->
>> http://ai-hz1-spark1:8088/proxy/application_1476957324184_0002),
>> /proxy/application_1476957324184_0002
>> 16/10/20 18:12:12 INFO ui.JettyUtils: Adding filter:
>> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
>> 16/10/20 18:12:12 INFO cluster.YarnClientSchedulerBackend:
>> SchedulerBackend is ready for scheduling beginning after waiting
>> maxRegisteredResourcesWaitingTime: 3(ms)
>> 16/10/20 18:12:12 WARN spark.SparkContext: Use an existing
>> SparkContext, some configuration may not take effect.
>> 16/10/20 18:12:12 INFO handler.ContextHandler: Started
>> o.s.j.s.ServletContextHandler@489091bd{/SQL,null,AVAILABLE}
>> 16/10/20 18:12:12 INFO handler.ContextHandler: Started
>> o.s.j.s.ServletContextHandler@1de9b505{/SQL/json,null,AVAILABLE}
>> 16/10/20 18:12:12 INFO handler.ContextHandler: Started
>> o.s.j.s.ServletContextHandler@378f002a{/SQL/execution,null,AVAILABLE}
>> 16/10/20 18:12:12 INFO handler.ContextHandler: Started
>> o.s.j.s.ServletContextHandler@2cc75074
{/SQL/execution/json,null,AVAILABLE}
>> 16/10/20 18:12:12 INFO handler.ContextHandler: Started
>> o.s.j.s.ServletContextHandler@2d64160c{/static/sql,null,AVAILABLE}
>> 16/10/20 18:12:12 INFO internal.SharedState: Warehouse path is
>> '/home/hadoop/spark-2.0.1-bin-hadoop2.7/spark-warehouse'.
>> 16/10/20 18:12:13 INFO spark.SparkContext: Starting job: reduce at
>> SparkPi.scala:38
>> 16/10/20 18:12:13 INFO scheduler.DAGScheduler: Got job 0 (reduce at
>> SparkPi.scala:38) with 1 output partitions
>> 16/10/20 18:12:13 INFO scheduler.DAGScheduler: Final stage:
>> ResultStage 0 (reduce at SparkPi.scala:38)
>> 16/10/20 18:12:13 INFO scheduler.DAGScheduler: Parents of final stage:
>> List()
>> 16/10/20 18:12:13 INFO scheduler.DAGScheduler: Missing parents: List()
>> 16/10/20 18:12:13 INFO scheduler.DAGScheduler: Submitting ResultStage
>> 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no
>> missing parents
>> 16/10/20 18:12:13 INFO memory.MemoryStore: Block broadcast_0 stored as
>> values in memory (estimated size 1832.0 B, free 366.3 MB)
>> 16/10/20 18:12:13 INFO memory.MemoryStore: Block broadcast_0_piece0
>> stored as bytes in memory (estimated size 1169.0 B, free 366.3 MB)
>> 16/10/20 18:12:13 INFO storage.BlockManagerInfo: Added
>> broadcast_0_piece0 in memory on 10.161.219.189:39161 (size: 1169.0 B,
>> free: 366.3 MB)
>> 16/10/20 18:12:13 INFO spark.SparkContext: Created broadcast 0 from
>> broadcast at DAGScheduler.scala:1012
>> 16/10/20 18:12:13 INFO scheduler.DAGScheduler: Submitting 1
>> missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at
>> SparkPi.scala:34)
>> 16/10/20 18:12:13 INFO cluster.YarnScheduler: Adding task set 0.0 with
>> 1 tasks
>> 16/10/20 18:12:14 ERROR cluster.YarnClientSchedulerBackend: Yarn
>> application has already exited with state FINISHED!
>> 16/10/20 18:12:14 INFO server.ServerConnector: Stopped
>> ServerConnector@389adf1d{HTTP/1.1}{0.0.0.0:4040}
>> 16/10/20 18:12:14 INFO handler.ContextHandler: Stopped
>> o.s.j.s.ServletContextHandler@841e575
{/stages/stage/kill,null,UNAVAILABLE}
>> 16/10/20 18:12:14 INFO handler.ContextHandler: Stopped
>>

Re: Spark Random Forest training cost same time on yarn as on standalone

2016-10-20 Thread Xi Shen

If you are running on your local, I do not see the point that you start
with 32 executors with 2 cores for each.

Also, you can check the Spark web console to find out where the time spent.

Also, you may want to read
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
.


On Thu, Oct 20, 2016 at 6:21 PM 陈哲  wrote:

> I'm training random forest model using spark2.0 on yarn with cmd like:
> $SPARK_HOME/bin/spark-submit \
>   --class com.netease.risk.prediction.HelpMain --master yarn
> --deploy-mode client --driver-cores 1 --num-executors 32 --executor-cores 2 
> --driver-memory
> 10g --executor-memory 6g \
>   --conf spark.rpc.askTimeout=3000 --conf spark.rpc.lookupTimeout=3000
> --conf spark.rpc.message.maxSize=2000  --conf spark.driver.maxResultSize=0
> \
> 
>
> the training process cost almost 8 hours
>
> And I tried training model on local machine with master(local[4]) , the
> whole process still cost 8 - 9 hours.
>
> My question is why running on yarn doesn't save time ? is this suppose to
> be distributed, with 32 executors ? And am I missing anything or what I can
> do to improve this and save more time ?
>
> Thanks
>
> --


Thanks,
David S.

Re: About Error while reading large JSON file in Spark

2016-10-18 Thread Xi Shen

It is a plain Java IO error. Your line is too long. You should alter your
JSON schema, so each line is a small JSON object.

Please do not concatenate all the object into an array, then write the
array in one line. You will have difficulty handling your super large JSON
array in Spark anyway.

Because one array is one object, it cannot be split into multiple partition.

On Tue, Oct 18, 2016 at 3:44 PM Chetan Khatri 
wrote:

> Hello Community members,
>
> I am getting error while reading large JSON file in spark,
>
> *Code:*
>
> val landingVisitor =
> sqlContext.read.json("s3n://hist-ngdp/lvisitor/lvisitor-01-aug.json")
>
> *Error:*
>
> 16/10/18 07:30:30 ERROR Executor: Exception in task 8.0 in stage 0.0 (TID
> 8)
> java.io.IOException: Too many bytes before newline: 2147483648
> at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:249)
> at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
> at
> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:135)
> at
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:237)
>
> What would be resolution for the same ?
>
> Thanks in Advance !
>
>
> --
> Yours Aye,
> Chetan Khatri.
>
> --

Thanks,
David S.

Re: Question about the offiicial binary Spark 2 package

2016-10-17 Thread Xi Shen

Okay, thank you.

On Mon, Oct 17, 2016 at 5:53 PM Sean Owen <so...@cloudera.com> wrote:

> You can take the "with user-provided Hadoop" binary from the download
> page, and yes that should mean it does not drag in a Hive dependency of its
> own.
>
> On Mon, Oct 17, 2016 at 7:08 AM Xi Shen <davidshe...@gmail.com> wrote:
>
> Hi,
>
> I want to configure my Hive to use Spark 2 as its engine. According to
> Hive's instruction, the Spark should build *without *Hadoop, nor Hive. I
> could build my own, but for some reason I hope I could use a official
> binary build.
>
> So I want to ask if the official Spark binary build labeled "with
> user-provided Hadoop" also implies "user-provided Hive".
>
> --
>
>
> Thanks,
> David S.
>
> --


Thanks,
David S.

Re: Is spark a right tool for updating a dataframe repeatedly

2016-10-17 Thread Xi Shen

I think most of the "big data" tools, like Spark and Hive, are not designed
to edit data. They are only designed to query data. I wonder in what
scenario you need to update large volume of data repetitively.


On Mon, Oct 17, 2016 at 2:00 PM Divya Gehlot 
wrote:

> If  my understanding is correct about your query
> In spark Dataframes are immutable , cant update the dataframe.
> you have to create a new dataframe to update the current dataframe .
>
>
> Thanks,
> Divya
>
>
> On 17 October 2016 at 09:50, Mungeol Heo  wrote:
>
> Hello, everyone.
>
> As I mentioned at the tile, I wonder that is spark a right tool for
> updating a data frame repeatedly until there is no more date to
> update.
>
> For example.
>
> while (if there was a updating) {
> update a data frame A
> }
>
> If it is the right tool, then what is the best practice for this kind of
> work?
> Thank you.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
> --


Thanks,
David S.

Question about the offiicial binary Spark 2 package

2016-10-17 Thread Xi Shen

Hi,

I want to configure my Hive to use Spark 2 as its engine. According to
Hive's instruction, the Spark should build *without *Hadoop, nor Hive. I
could build my own, but for some reason I hope I could use a official
binary build.

So I want to ask if the official Spark binary build labeled "with
user-provided Hadoop" also implies "user-provided Hive".

-- 


Thanks,
David S.

Hi,

2016-08-21 Thread Xi Shen

I found there are several .conf files in the conf directory, which one is
used as the default one when I click the "new" button on the notebook
homepage? I want to edit the default profile configuration so all my
notebooks are created with custom settings.

-- 


Thanks,
David S.

How to implement a InputDStream like the twitter stream in Spark?

2016-08-17 Thread Xi Shen

Hi,

First I am not sure if I should inherit from InputDStream, or
ReceiverInputDStream. For ReceiverInputDStream, why would I want to run a
receiver on each worker nodes?

If I want to inherit InputDStream, what should I do in the comput() method?

-- 


Thanks,
David S.

Re: Why KMeans with mllib is so slow ?

2016-03-12 Thread Xi Shen

Hi Chitturi,

Please checkout
https://spark.apache.org/docs/1.0.1/api/java/org/apache/spark/mllib/clustering/KMeans.html#setInitializationSteps(int
).

I think it is caused by the initialization step. the "kmeans||" method does
not initialize dataset in parallel. If your dataset is large, it takes a
long time to initialize. Just changed to "random".

Hope it helps.


On Sun, Mar 13, 2016 at 2:58 PM Chitturi Padma 
wrote:

> Hi All,
>
>   I  am facing the same issue. taking k values from 60 to 120 incrementing
> by 10 each time i.e k takes value 60,70,80,...120 the algorithm takes
> around 2.5h on a 800 MB data set with 38 dimensions.
> On Sun, Mar 29, 2015 at 9:34 AM, davidshen84 [via Apache Spark User List]
> <[hidden email] >
> wrote:
>
>> Hi Jao,
>>
>> Sorry to pop up this old thread. I am have the same problem like you did.
>> I want to know if you have figured out how to improve k-means on Spark.
>>
>> I am using Spark 1.2.0. My data set is about 270k vectors, each has about
>> 350 dimensions. If I set k=500, the job takes about 3hrs on my cluster. The
>> cluster has 7 executors, each has 8 cores...
>>
>> If I set k=5000 which is the required value for my task, the job goes on
>> forever...
>>
>>
>> Thanks,
>> David
>>
>>
>> --
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html
>>
> To start a new topic under Apache Spark User List, email [hidden email]
>> 
>> To unsubscribe from Apache Spark User List, click here.
>> NAML
>> 
>>
>
>
> --
> View this message in context: Re: Why KMeans with mllib is so slow ?
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>
-- 

Regards,
David

Spark job concurrency problem

2015-05-04 Thread Xi Shen

Hi,

I have two small RDD, each has about 600 records. In my code, I did

val rdd1 = sc...cache()
val rdd2 = sc...cache()

val result = rdd1.cartesian(rdd2).*repartition*(num_cpu).map {case (a,b) =
  some_expensive_job(a,b)
}

I ran my job in YARN cluster with --master yarn-cluster, I have 6
executor, and each has a large memory volume.

However, I noticed my job is very slow. I went to the RM page, and found
there are two containers, one is the driver, one is the worker. I guess
this is correct?

I went to the worker's log, and monitor the log detail. My app print some
information, so I can use them to estimate the progress of the map
operation. Looking at the log, it feels like the jobs are done one by one
sequentially, rather than #cpu batch at a time.

I checked the worker node, and their CPU are all busy.



[image: --]
Xi Shen
[image: http://]about.me/davidshen
http://about.me/davidshen?promo=email_sig
  http://about.me/davidshen

IOUtils cannot write anything in Spark?

2015-04-23 Thread Xi Shen

Hi,

I have a RDD of some processed data. I want to write these files to HDFS,
but not for future M/R processing. I want to write plain old style text
file. I tried:

rdd foreach {d =
  val file = // create the file using a HDFS FileSystem
  val lines = d map {
// format data into string
  }

  IOUtils.writeLines(lines, System.separator(), file)
}

Note, I was using the IOUtils from common-io, not from Hadoop package.

The results are all file are created in myHDFS, but has no data at all...



[image: --]
Xi Shen
[image: http://]about.me/davidshen
http://about.me/davidshen?promo=email_sig
  http://about.me/davidshen

Re: kmeans|| in Spark is not real paralleled?

2015-04-03 Thread Xi Shen

Hi Xingrui,

I have create JIRA https://issues.apache.org/jira/browse/SPARK-6706, and
attached the sample code. But I could not attache the test data. I will
update the bug once I found a place to host the test data.


Thanks,
David


On Tue, Mar 31, 2015 at 8:18 AM Xiangrui Meng men...@gmail.com wrote:

 This PR updated the k-means|| initialization:
 https://github.com/apache/spark/commit/ca7910d6dd7693be2a675a0d6a6fcc9eb0aaeb5d,
 which was included in 1.3.0. It should fix kmean|| initialization with
 large k. Please create a JIRA for this issue and send me the code and the
 dataset to produce this problem. Thanks! -Xiangrui

 On Sun, Mar 29, 2015 at 1:20 AM, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 I have opened a couple of threads asking about k-means performance
 problem in Spark. I think I made a little progress.

 Previous I use the simplest way of KMeans.train(rdd, k, maxIterations).
 It uses the kmeans|| initialization algorithm which supposedly to be a
 faster version of kmeans++ and give better results in general.

 But I observed that if the k is very large, the initialization step takes
 a long time. From the CPU utilization chart, it looks like only one thread
 is working. Please see
 https://stackoverflow.com/questions/29326433/cpu-gap-when-doing-k-means-with-spark
 .

 I read the paper,
 http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf, and it
 points out kmeans++ initialization algorithm will suffer if k is large.
 That's why the paper contributed the kmeans|| algorithm.


 If I invoke KMeans.train by using the random initialization algorithm, I
 do not observe this problem, even with very large k, like k=5000. This
 makes me suspect that the kmeans|| in Spark is not properly implemented and
 do not utilize parallel implementation.


 I have also tested my code and data set with Spark 1.3.0, and I still
 observe this problem. I quickly checked the PR regarding the KMeans
 algorithm change from 1.2.0 to 1.3.0. It seems to be only code improvement
 and polish, not changing/improving the algorithm.


 I originally worked on Windows 64bit environment, and I also tested on
 Linux 64bit environment. I could provide the code and data set if anyone
 want to reproduce this problem.


 I hope a Spark developer could comment on this problem and help
 identifying if it is a bug.


 Thanks,

 [image: --]
 Xi Shen
 [image: http://]about.me/davidshen
 http://about.me/davidshen?promo=email_sig
   http://about.me/davidshen

Re: Why k-means cluster hang for a long time?

2015-03-30 Thread Xi Shen

For the same amount of data, if I set the k=500, the job finished in about
3 hrs. I wonder if I set k=5000, the job could finish in 30 hrs...the
longest time I waited was 12 hrs...

If I use kmeans-random, same amount of data, k=5000, the job finished in
less than 2 hrs.

I think current kmeans|| implementation could not handle large vector
dimensions properly. In my case, my vector has about 350 dimensions. I
found another post complaining about kmeans performance in Spark, and that
guy has vectors of 200 dimensions.

It is possible people never tested large dimension case.


Thanks,
David




On Tue, Mar 31, 2015 at 4:00 AM Xiangrui Meng men...@gmail.com wrote:

 Hi Xi,

 Please create a JIRA if it takes longer to locate the issue. Did you
 try a smaller k?

 Best,
 Xiangrui

 On Thu, Mar 26, 2015 at 5:45 PM, Xi Shen davidshe...@gmail.com wrote:
  Hi Burak,
 
  After I added .repartition(sc.defaultParallelism), I can see from the
 log
  the partition number is set to 32. But in the Spark UI, it seems all the
  data are loaded onto one executor. Previously they were loaded onto 4
  executors.
 
  Any idea?
 
 
  Thanks,
  David
 
 
  On Fri, Mar 27, 2015 at 11:01 AM Xi Shen davidshe...@gmail.com wrote:
 
  How do I get the number of cores that I specified at the command line? I
  want to use spark.default.parallelism. I have 4 executors, each has 8
  cores. According to
  https://spark.apache.org/docs/1.2.0/configuration.html#
 execution-behavior,
  the spark.default.parallelism value will be 4 * 8 = 32...I think it
 is too
  large, or inappropriate. Please give some suggestion.
 
  I have already used cache, and count to pre-cache.
 
  I can try with smaller k for testing, but eventually I will have to use
 k
  = 5000 or even large. Because I estimate our data set would have that
 much
  of clusters.
 
 
  Thanks,
  David
 
 
  On Fri, Mar 27, 2015 at 10:40 AM Burak Yavuz brk...@gmail.com wrote:
 
  Hi David,
  The number of centroids (k=5000) seems too large and is probably the
  cause of the code taking too long.
 
  Can you please try the following:
  1) Repartition data to the number of available cores with
  .repartition(numCores)
  2) cache data
  3) call .count() on data right before k-means
  4) try k=500 (even less if possible)
 
  Thanks,
  Burak
 
  On Mar 26, 2015 4:15 PM, Xi Shen davidshe...@gmail.com wrote:
  
   The code is very simple.
  
   val data = sc.textFile(very/large/text/file) map { l =
 // turn each line into dense vector
 Vectors.dense(...)
   }
  
   // the resulting data set is about 40k vectors
  
   KMeans.train(data, k=5000, maxIterations=500)
  
   I just kill my application. In the log I found this:
  
   15/03/26 11:42:43 INFO storage.BlockManagerMaster: Updated info of
   block broadcast_26_piece0
   15/03/26 23:02:57 WARN server.TransportChannelHandler: Exception in
   connection from
   workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.
 net/100.72.84.107:56277
   java.io.IOException: An existing connection was forcibly closed by
 the
   remote host
  
   Notice the time gap. I think it means the work node did not generate
   any log at all for about 12hrs...does it mean they are not working
 at all?
  
   But when testing with very small data set, my application works and
   output expected data.
  
  
   Thanks,
   David
  
  
   On Fri, Mar 27, 2015 at 10:04 AM Burak Yavuz brk...@gmail.com
 wrote:
  
   Can you share the code snippet of how you call k-means? Do you cache
   the data before k-means? Did you repartition the data?
  
   On Mar 26, 2015 4:02 PM, Xi Shen davidshe...@gmail.com wrote:
  
   OH, the job I talked about has ran more than 11 hrs without a
   result...it doesn't make sense.
  
  
   On Fri, Mar 27, 2015 at 9:48 AM Xi Shen davidshe...@gmail.com
   wrote:
  
   Hi Burak,
  
   My iterations is set to 500. But I think it should also stop of
 the
   centroid coverages, right?
  
   My spark is 1.2.0, working in windows 64 bit. My data set is about
   40k vectors, each vector has about 300 features, all normalised.
 All work
   node have sufficient memory and disk space.
  
   Thanks,
   David
  
  
   On Fri, 27 Mar 2015 02:48 Burak Yavuz brk...@gmail.com wrote:
  
   Hi David,
  
   When the number of runs are large and the data is not properly
   partitioned, it seems that K-Means is hanging according to my
 experience.
   Especially setting the number of runs to something high
 drastically
   increases the work in executors. If that's not the case, can you
 give more
   info on what Spark version you are using, your setup, and your
 dataset?
  
   Thanks,
   Burak
  
   On Mar 26, 2015 5:10 AM, Xi Shen davidshe...@gmail.com
 wrote:
  
   Hi,
  
   When I run k-means cluster with Spark, I got this in the last
 two
   lines in the log:
  
   15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast
 26
   15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5
  
  
  
   Then it hangs for a long time. There's

Re: Why KMeans with mllib is so slow ?

2015-03-29 Thread Xi Shen

Hi Burak,

Unfortunately, I am expected to do my work in HDInsight environment which
only supports Spark 1.2.0 with Microsoft's flavor. I cannot simple replace
it with Spark 1.3.

I think the problem I am observing is caused by kmeans|| initialization
step. I will open another thread to discuss it.


Thanks,
David





[image: --]
Xi Shen
[image: http://]about.me/davidshen
http://about.me/davidshen?promo=email_sig
  http://about.me/davidshen

On Sun, Mar 29, 2015 at 4:34 PM, Burak Yavuz brk...@gmail.com wrote:

 Hi David,

 Can you also try with Spark 1.3 if possible? I believe there was a 2x
 improvement on K-Means between 1.2 and 1.3.

 Thanks,
 Burak



 On Sat, Mar 28, 2015 at 9:04 PM, davidshen84 davidshe...@gmail.com
 wrote:

 Hi Jao,

 Sorry to pop up this old thread. I am have the same problem like you did.
 I
 want to know if you have figured out how to improve k-means on Spark.

 I am using Spark 1.2.0. My data set is about 270k vectors, each has about
 350 dimensions. If I set k=500, the job takes about 3hrs on my cluster.
 The
 cluster has 7 executors, each has 8 cores...

 If I set k=5000 which is the required value for my task, the job goes on
 forever...


 Thanks,
 David




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

kmeans|| in Spark is not real paralleled?

2015-03-29 Thread Xi Shen

Hi,

I have opened a couple of threads asking about k-means performance problem
in Spark. I think I made a little progress.

Previous I use the simplest way of KMeans.train(rdd, k, maxIterations). It
uses the kmeans|| initialization algorithm which supposedly to be a
faster version of kmeans++ and give better results in general.

But I observed that if the k is very large, the initialization step takes a
long time. From the CPU utilization chart, it looks like only one thread is
working. Please see
https://stackoverflow.com/questions/29326433/cpu-gap-when-doing-k-means-with-spark
.

I read the paper, http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf,
and it points out kmeans++ initialization algorithm will suffer if k is
large. That's why the paper contributed the kmeans|| algorithm.


If I invoke KMeans.train by using the random initialization algorithm, I do
not observe this problem, even with very large k, like k=5000. This makes
me suspect that the kmeans|| in Spark is not properly implemented and do
not utilize parallel implementation.


I have also tested my code and data set with Spark 1.3.0, and I still
observe this problem. I quickly checked the PR regarding the KMeans
algorithm change from 1.2.0 to 1.3.0. It seems to be only code improvement
and polish, not changing/improving the algorithm.


I originally worked on Windows 64bit environment, and I also tested on
Linux 64bit environment. I could provide the code and data set if anyone
want to reproduce this problem.


I hope a Spark developer could comment on this problem and help identifying
if it is a bug.


Thanks,

[image: --]
Xi Shen
[image: http://]about.me/davidshen
http://about.me/davidshen?promo=email_sig
  http://about.me/davidshen

Re: k-means can only run on one executor with one thread?

2015-03-28 Thread Xi Shen

I have put more detail of my problem at
http://stackoverflow.com/questions/29295420/spark-kmeans-computation-cannot-be-distributed

It is really appreciate if you can help me take a look at this problem. I
have tried various settings and ways to load/partition my data, but I just
cannot get rid that long pause.

Thanks,
David

[image: --]
Xi Shen
[image: http://]about.me/davidshen
http://about.me/davidshen?promo=email_sig
http://about.me/davidshen

On Sat, Mar 28, 2015 at 2:38 PM, Xi Shen davidshe...@gmail.com wrote:

Yes, I have done repartition.

I tried to repartition to the number of cores in my cluster. Not helping...
I tried to repartition to the number of centroids (k value). Not helping...

On Sat, Mar 28, 2015 at 7:27 AM Joseph Bradley jos...@databricks.com
wrote:

Can you try specifying the number of partitions when you load the data to
equal the number of executors? If your ETL changes the number of
partitions, you can also repartition before calling KMeans.

On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen davidshe...@gmail.com wrote:

Hi,

I have a large data set, and I expects to get 5000 clusters.

I load the raw data, convert them into DenseVector; then I did
repartition and cache; finally I give the RDD[Vector] to KMeans.train().

Now the job is running, and data are loaded. But according to the Spark
UI, all data are loaded onto one executor. I checked that executor, and its
CPU workload is very low. I think it is using only 1 of the 8 cores. And
all other 3 executors are at rest.

Did I miss something? Is it possible to distribute the workload to all 4
executors?

Thanks,
David

Re: k-means can only run on one executor with one thread?

2015-03-28 Thread Xi Shen

My vector dimension is like 360 or so. The data count is about 270k. My
driver has 2.9G memory. I attache a screenshot of current executor status.
I submitted this job with --master yarn-cluster. I have a total of 7
worker node, one of them acts as the driver. In the screenshot, you can see
all worker nodes have loaded some data, but the driver is not loaded with
any data.

But the funny thing is, when I log on to the driver, and check its CPU 
memory status. I saw one java process using about 18% of CPU, and is using
about 1.6 GB memory.

[image: Inline image 1]

On Sat, Mar 28, 2015 at 7:06 PM Reza Zadeh r...@databricks.com wrote:

 How many dimensions does your data have? The size of the k-means model is
 k * d, where d is the dimension of the data.

 Since you're using k=1000, if your data has dimension higher than say,
 10,000, you will have trouble, because k*d doubles have to fit in the
 driver.

 Reza

 On Sat, Mar 28, 2015 at 12:27 AM, Xi Shen davidshe...@gmail.com wrote:

 I have put more detail of my problem at http://stackoverflow.com/
 questions/29295420/spark-kmeans-computation-cannot-be-distributed

 It is really appreciate if you can help me take a look at this problem. I
 have tried various settings and ways to load/partition my data, but I just
 cannot get rid that long pause.


 Thanks,
 David





 [image: --]
 Xi Shen
 [image: http://]about.me/davidshen
 http://about.me/davidshen?promo=email_sig
   http://about.me/davidshen

 On Sat, Mar 28, 2015 at 2:38 PM, Xi Shen davidshe...@gmail.com wrote:

 Yes, I have done repartition.

 I tried to repartition to the number of cores in my cluster. Not
 helping...
 I tried to repartition to the number of centroids (k value). Not
 helping...


 On Sat, Mar 28, 2015 at 7:27 AM Joseph Bradley jos...@databricks.com
 wrote:

 Can you try specifying the number of partitions when you load the data
 to equal the number of executors?  If your ETL changes the number of
 partitions, you can also repartition before calling KMeans.


 On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 I have a large data set, and I expects to get 5000 clusters.

 I load the raw data, convert them into DenseVector; then I did
 repartition and cache; finally I give the RDD[Vector] to KMeans.train().

 Now the job is running, and data are loaded. But according to the
 Spark UI, all data are loaded onto one executor. I checked that executor,
 and its CPU workload is very low. I think it is using only 1 of the 8
 cores. And all other 3 executors are at rest.

 Did I miss something? Is it possible to distribute the workload to all
 4 executors?


 Thanks,
 David

Re: k-means can only run on one executor with one thread?

2015-03-27 Thread Xi Shen

Yes, I have done repartition.

I tried to repartition to the number of cores in my cluster. Not helping...
I tried to repartition to the number of centroids (k value). Not helping...


On Sat, Mar 28, 2015 at 7:27 AM Joseph Bradley jos...@databricks.com
wrote:

 Can you try specifying the number of partitions when you load the data to
 equal the number of executors?  If your ETL changes the number of
 partitions, you can also repartition before calling KMeans.


 On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 I have a large data set, and I expects to get 5000 clusters.

 I load the raw data, convert them into DenseVector; then I did
 repartition and cache; finally I give the RDD[Vector] to KMeans.train().

 Now the job is running, and data are loaded. But according to the Spark
 UI, all data are loaded onto one executor. I checked that executor, and its
 CPU workload is very low. I think it is using only 1 of the 8 cores. And
 all other 3 executors are at rest.

 Did I miss something? Is it possible to distribute the workload to all 4
 executors?


 Thanks,
 David

Re: How to deploy binary dependencies to workers?

2015-03-26 Thread Xi Shen

OK, after various testing, I found the native library can be loaded if
running in yarn-cluster mode. But I still cannot find out why it won't load
when running in yarn-client mode...

Thanks,
David

On Thu, Mar 26, 2015 at 4:21 PM Xi Shen davidshe...@gmail.com wrote:

Not of course...all machines in HDInsight are Windows 64bit server. And I
have made sure all my DLLs are for 64bit machines. I have managed to get
those DLLs loade on my local machine which is also Windows 64bit.

[image: --]
Xi Shen
[image: http://]about.me/davidshen
http://about.me/davidshen?promo=email_sig
http://about.me/davidshen

On Thu, Mar 26, 2015 at 11:11 AM, DB Tsai dbt...@dbtsai.com wrote:

Are you deploying the windows dll to linux machine?

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com

On Wed, Mar 25, 2015 at 3:57 AM, Xi Shen davidshe...@gmail.com wrote:
I think you meant to use the --files to deploy the DLLs. I gave a
try, but
it did not work.

From the Spark UI, Environment tab, I can see

spark.yarn.dist.files

file:/c:/openblas/libgcc_s_seh-1.dll,file:/c:/openblas/libblas3.dll,file:/c:/openblas/libgfortran-3.dll,file:/c:/openblas/liblapack3.dll,file:/c:/openblas/libquadmath-0.dll

I think my DLLs are all deployed. But I still got the warn message that
native BLAS library cannot be load.

And idea?

Thanks,
David

On Wed, Mar 25, 2015 at 5:40 AM DB Tsai dbt...@dbtsai.com wrote:

I would recommend to upload those jars to HDFS, and use add jars
option in spark-submit with URI from HDFS instead of URI from local
filesystem. Thus, it can avoid the problem of fetching jars from
driver which can be a bottleneck.

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com

On Tue, Mar 24, 2015 at 4:13 AM, Xi Shen davidshe...@gmail.com
wrote:
Hi,

I am doing ML using Spark mllib. However, I do not have full control
to
the
cluster. I am using Microsoft Azure HDInsight

I want to deploy the BLAS or whatever required dependencies to
accelerate
the computation. But I don't know how to deploy those DLLs when I
submit
my
JAR to the cluster.

I know how to pack those DLLs into a jar. The real challenge is how
to
let
the system find them...

Thanks,
David

Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen

Hi,

When I run k-means cluster with Spark, I got this in the last two lines in
the log:

15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26
15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5



Then it hangs for a long time. There's no active job. The driver machine is
idle. I cannot access the work node, I am not sure if they are busy.

I understand k-means may take a long time to finish. But why no active job?
no log?


Thanks,
David

SparkContext.wholeTextFiles throws not serializable exception

2015-03-26 Thread Xi Shen

Hi,

I want to load my data in this way:

sc.wholeTextFiles(opt.input) map { x = (x._1,
x._2.lines.filter(!_.isEmpty).toSeq) }


But I got

java.io.NotSerializableException: scala.collection.Iterator$$anon$13

But if I use x._2.split('\n'), I can get the expected result. I want to
know what's wrong with using the lines() function.


Thanks,

[image: --]
Xi Shen
[image: http://]about.me/davidshen
http://about.me/davidshen?promo=email_sig
  http://about.me/davidshen

k-means can only run on one executor with one thread?

2015-03-26 Thread Xi Shen

Hi,

I have a large data set, and I expects to get 5000 clusters.

I load the raw data, convert them into DenseVector; then I did repartition
and cache; finally I give the RDD[Vector] to KMeans.train().

Now the job is running, and data are loaded. But according to the Spark UI,
all data are loaded onto one executor. I checked that executor, and its CPU
workload is very low. I think it is using only 1 of the 8 cores. And all
other 3 executors are at rest.

Did I miss something? Is it possible to distribute the workload to all 4
executors?


Thanks,
David

Re: SparkContext.wholeTextFiles throws not serializable exception

2015-03-26 Thread Xi Shen

I have to use .lines.toArray.toSeq

A little tricky.




[image: --]
Xi Shen
[image: http://]about.me/davidshen
http://about.me/davidshen?promo=email_sig
  http://about.me/davidshen

On Fri, Mar 27, 2015 at 4:41 PM, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 I want to load my data in this way:

 sc.wholeTextFiles(opt.input) map { x = (x._1,
 x._2.lines.filter(!_.isEmpty).toSeq) }


 But I got

 java.io.NotSerializableException: scala.collection.Iterator$$anon$13

 But if I use x._2.split('\n'), I can get the expected result. I want to
 know what's wrong with using the lines() function.


 Thanks,

 [image: --]
 Xi Shen
 [image: http://]about.me/davidshen
 http://about.me/davidshen?promo=email_sig
   http://about.me/davidshen

Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen

Hi Burak,

After I added .repartition(sc.defaultParallelism), I can see from the log
the partition number is set to 32. But in the Spark UI, it seems all the
data are loaded onto one executor. Previously they were loaded onto 4
executors.

Any idea?


Thanks,
David


On Fri, Mar 27, 2015 at 11:01 AM Xi Shen davidshe...@gmail.com wrote:

 How do I get the number of cores that I specified at the command line? I
 want to use spark.default.parallelism. I have 4 executors, each has 8
 cores. According to
 https://spark.apache.org/docs/1.2.0/configuration.html#execution-behavior,
 the spark.default.parallelism value will be 4 * 8 = 32...I think it is
 too large, or inappropriate. Please give some suggestion.

 I have already used cache, and count to pre-cache.

 I can try with smaller k for testing, but eventually I will have to use k
 = 5000 or even large. Because I estimate our data set would have that much
 of clusters.


 Thanks,
 David


 On Fri, Mar 27, 2015 at 10:40 AM Burak Yavuz brk...@gmail.com wrote:

 Hi David,
 The number of centroids (k=5000) seems too large and is probably the
 cause of the code taking too long.

 Can you please try the following:
 1) Repartition data to the number of available cores with
 .repartition(numCores)
 2) cache data
 3) call .count() on data right before k-means
 4) try k=500 (even less if possible)

 Thanks,
 Burak

 On Mar 26, 2015 4:15 PM, Xi Shen davidshe...@gmail.com wrote:
 
  The code is very simple.
 
  val data = sc.textFile(very/large/text/file) map { l =
// turn each line into dense vector
Vectors.dense(...)
  }
 
  // the resulting data set is about 40k vectors
 
  KMeans.train(data, k=5000, maxIterations=500)
 
  I just kill my application. In the log I found this:
 
  15/03/26 11:42:43 INFO storage.BlockManagerMaster: Updated info of
 block broadcast_26_piece0
  15/03/26 23:02:57 WARN server.TransportChannelHandler: Exception in
 connection from workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.
 net/100.72.84.107:56277
  java.io.IOException: An existing connection was forcibly closed by the
 remote host
 
  Notice the time gap. I think it means the work node did not generate
 any log at all for about 12hrs...does it mean they are not working at all?
 
  But when testing with very small data set, my application works and
 output expected data.
 
 
  Thanks,
  David
 
 
  On Fri, Mar 27, 2015 at 10:04 AM Burak Yavuz brk...@gmail.com wrote:
 
  Can you share the code snippet of how you call k-means? Do you cache
 the data before k-means? Did you repartition the data?
 
  On Mar 26, 2015 4:02 PM, Xi Shen davidshe...@gmail.com wrote:
 
  OH, the job I talked about has ran more than 11 hrs without a
 result...it doesn't make sense.
 
 
  On Fri, Mar 27, 2015 at 9:48 AM Xi Shen davidshe...@gmail.com
 wrote:
 
  Hi Burak,
 
  My iterations is set to 500. But I think it should also stop of the
 centroid coverages, right?
 
  My spark is 1.2.0, working in windows 64 bit. My data set is about
 40k vectors, each vector has about 300 features, all normalised. All work
 node have sufficient memory and disk space.
 
  Thanks,
  David
 
 
  On Fri, 27 Mar 2015 02:48 Burak Yavuz brk...@gmail.com wrote:
 
  Hi David,
 
  When the number of runs are large and the data is not properly
 partitioned, it seems that K-Means is hanging according to my experience.
 Especially setting the number of runs to something high drastically
 increases the work in executors. If that's not the case, can you give more
 info on what Spark version you are using, your setup, and your dataset?
 
  Thanks,
  Burak
 
  On Mar 26, 2015 5:10 AM, Xi Shen davidshe...@gmail.com wrote:
 
  Hi,
 
  When I run k-means cluster with Spark, I got this in the last two
 lines in the log:
 
  15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26
  15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5
 
 
 
  Then it hangs for a long time. There's no active job. The driver
 machine is idle. I cannot access the work node, I am not sure if they are
 busy.
 
  I understand k-means may take a long time to finish. But why no
 active job? no log?
 
 
  Thanks,
  David

Re: Building spark 1.2 from source requires more dependencies

2015-03-26 Thread Xi Shen

It it bought in by another dependency, so you do not need to specify it
explicitly...I think this is what Ted mean.

On Fri, Mar 27, 2015 at 9:48 AM Pala M Muthaia mchett...@rocketfuelinc.com
wrote:

 +spark-dev

 Yes, the dependencies are there. I guess my question is how come the build
 is succeeding in the mainline then, without adding these dependencies?

 On Thu, Mar 26, 2015 at 3:44 PM, Ted Yu yuzhih...@gmail.com wrote:

 Looking at output from dependency:tree, servlet-api is brought in by the
 following:

 [INFO] +- org.apache.cassandra:cassandra-all:jar:1.2.6:compile
 [INFO] |  +- org.antlr:antlr:jar:3.2:compile
 [INFO] |  +- com.googlecode.json-simple:json-simple:jar:1.1:compile
 [INFO] |  +- org.yaml:snakeyaml:jar:1.6:compile
 [INFO] |  +- edu.stanford.ppl:snaptree:jar:0.1:compile
 [INFO] |  +- org.mindrot:jbcrypt:jar:0.3m:compile
 [INFO] |  +- org.apache.thrift:libthrift:jar:0.7.0:compile
 [INFO] |  |  \- javax.servlet:servlet-api:jar:2.5:compile

 FYI

 On Thu, Mar 26, 2015 at 3:36 PM, Pala M Muthaia 
 mchett...@rocketfuelinc.com wrote:

 Hi,

 We are trying to build spark 1.2 from source (tip of the branch-1.2 at
 the moment). I tried to build spark using the following command:

 mvn -U -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive
 -Phive-thriftserver -DskipTests clean package

 I encountered various missing class definition exceptions (e.g: class
 javax.servlet.ServletException not found).

 I eventually got the build to succeed after adding the following set of
 dependencies to the spark-core's pom.xml:

 dependency
   groupIdjavax.servlet/groupId
   artifactId*servlet-api*/artifactId
   version3.0/version
 /dependency

 dependency
   groupIdorg.eclipse.jetty/groupId
   artifactId*jetty-io*/artifactId
 /dependency

 dependency
   groupIdorg.eclipse.jetty/groupId
   artifactId*jetty-http*/artifactId
 /dependency

 dependency
   groupIdorg.eclipse.jetty/groupId
   artifactId*jetty-servlet*/artifactId
 /dependency

 Pretty much all of the missing class definition errors came up while
 building HttpServer.scala, and went away after the above dependencies were
 included.

 My guess is official build for spark 1.2 is working already. My question
 is what is wrong with my environment or setup, that requires me to add
 dependencies to pom.xml in this manner, to get this build to succeed.

 Also, i am not sure if this build would work at runtime for us, i am
 still testing this out.


 Thanks,
 pala

Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen

How do I get the number of cores that I specified at the command line? I
want to use spark.default.parallelism. I have 4 executors, each has 8
cores. According to
https://spark.apache.org/docs/1.2.0/configuration.html#execution-behavior,
the spark.default.parallelism value will be 4 * 8 = 32...I think it is
too large, or inappropriate. Please give some suggestion.

I have already used cache, and count to pre-cache.

I can try with smaller k for testing, but eventually I will have to use k =
5000 or even large. Because I estimate our data set would have that much of
clusters.


Thanks,
David


On Fri, Mar 27, 2015 at 10:40 AM Burak Yavuz brk...@gmail.com wrote:

 Hi David,
 The number of centroids (k=5000) seems too large and is probably the cause
 of the code taking too long.

 Can you please try the following:
 1) Repartition data to the number of available cores with
 .repartition(numCores)
 2) cache data
 3) call .count() on data right before k-means
 4) try k=500 (even less if possible)

 Thanks,
 Burak

 On Mar 26, 2015 4:15 PM, Xi Shen davidshe...@gmail.com wrote:
 
  The code is very simple.
 
  val data = sc.textFile(very/large/text/file) map { l =
// turn each line into dense vector
Vectors.dense(...)
  }
 
  // the resulting data set is about 40k vectors
 
  KMeans.train(data, k=5000, maxIterations=500)
 
  I just kill my application. In the log I found this:
 
  15/03/26 11:42:43 INFO storage.BlockManagerMaster: Updated info of block
 broadcast_26_piece0
  15/03/26 23:02:57 WARN server.TransportChannelHandler: Exception in
 connection from
 workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net/100.72.84.107:56277
  java.io.IOException: An existing connection was forcibly closed by the
 remote host
 
  Notice the time gap. I think it means the work node did not generate any
 log at all for about 12hrs...does it mean they are not working at all?
 
  But when testing with very small data set, my application works and
 output expected data.
 
 
  Thanks,
  David
 
 
  On Fri, Mar 27, 2015 at 10:04 AM Burak Yavuz brk...@gmail.com wrote:
 
  Can you share the code snippet of how you call k-means? Do you cache
 the data before k-means? Did you repartition the data?
 
  On Mar 26, 2015 4:02 PM, Xi Shen davidshe...@gmail.com wrote:
 
  OH, the job I talked about has ran more than 11 hrs without a
 result...it doesn't make sense.
 
 
  On Fri, Mar 27, 2015 at 9:48 AM Xi Shen davidshe...@gmail.com wrote:
 
  Hi Burak,
 
  My iterations is set to 500. But I think it should also stop of the
 centroid coverages, right?
 
  My spark is 1.2.0, working in windows 64 bit. My data set is about
 40k vectors, each vector has about 300 features, all normalised. All work
 node have sufficient memory and disk space.
 
  Thanks,
  David
 
 
  On Fri, 27 Mar 2015 02:48 Burak Yavuz brk...@gmail.com wrote:
 
  Hi David,
 
  When the number of runs are large and the data is not properly
 partitioned, it seems that K-Means is hanging according to my experience.
 Especially setting the number of runs to something high drastically
 increases the work in executors. If that's not the case, can you give more
 info on what Spark version you are using, your setup, and your dataset?
 
  Thanks,
  Burak
 
  On Mar 26, 2015 5:10 AM, Xi Shen davidshe...@gmail.com wrote:
 
  Hi,
 
  When I run k-means cluster with Spark, I got this in the last two
 lines in the log:
 
  15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26
  15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5
 
 
 
  Then it hangs for a long time. There's no active job. The driver
 machine is idle. I cannot access the work node, I am not sure if they are
 busy.
 
  I understand k-means may take a long time to finish. But why no
 active job? no log?
 
 
  Thanks,
  David

Re: K Means cluster with spark

2015-03-26 Thread Xi Shen

Hi Sandeep,

I followed the DenseKMeans example which comes with the spark package.

My total vectors are about 40k, and my k=500. All my code are written in
Scala.

Thanks,
David

On Fri, 27 Mar 2015 05:51 sandeep vura sandeepv...@gmail.com wrote:

 Hi Shen,

 I am also working on k means clustering with spark. May i know which links
 you are following to get understand k means clustering with spark and also
 need sample k means program to process in spark. which is written in scala.

 Regards,
 Sandeep.v

Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen

Hi Burak,

My iterations is set to 500. But I think it should also stop of the
centroid coverages, right?

My spark is 1.2.0, working in windows 64 bit. My data set is about 40k
vectors, each vector has about 300 features, all normalised. All work node
have sufficient memory and disk space.

Thanks,
David

On Fri, 27 Mar 2015 02:48 Burak Yavuz brk...@gmail.com wrote:

 Hi David,

 When the number of runs are large and the data is not properly
 partitioned, it seems that K-Means is hanging according to my experience.
 Especially setting the number of runs to something high drastically
 increases the work in executors. If that's not the case, can you give more
 info on what Spark version you are using, your setup, and your dataset?

 Thanks,
 Burak
 On Mar 26, 2015 5:10 AM, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 When I run k-means cluster with Spark, I got this in the last two lines
 in the log:

 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26
 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5



 Then it hangs for a long time. There's no active job. The driver machine
 is idle. I cannot access the work node, I am not sure if they are busy.

 I understand k-means may take a long time to finish. But why no active
 job? no log?


 Thanks,
 David

Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen

OH, the job I talked about has ran more than 11 hrs without a result...it
doesn't make sense.


On Fri, Mar 27, 2015 at 9:48 AM Xi Shen davidshe...@gmail.com wrote:

 Hi Burak,

 My iterations is set to 500. But I think it should also stop of the
 centroid coverages, right?

 My spark is 1.2.0, working in windows 64 bit. My data set is about 40k
 vectors, each vector has about 300 features, all normalised. All work node
 have sufficient memory and disk space.

 Thanks,
 David

 On Fri, 27 Mar 2015 02:48 Burak Yavuz brk...@gmail.com wrote:

 Hi David,

 When the number of runs are large and the data is not properly
 partitioned, it seems that K-Means is hanging according to my experience.
 Especially setting the number of runs to something high drastically
 increases the work in executors. If that's not the case, can you give more
 info on what Spark version you are using, your setup, and your dataset?

 Thanks,
 Burak
 On Mar 26, 2015 5:10 AM, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 When I run k-means cluster with Spark, I got this in the last two lines
 in the log:

 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26
 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5



 Then it hangs for a long time. There's no active job. The driver machine
 is idle. I cannot access the work node, I am not sure if they are busy.

 I understand k-means may take a long time to finish. But why no active
 job? no log?


 Thanks,
 David

Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen

The code is very simple.

val data = sc.textFile(very/large/text/file) map { l =
  // turn each line into dense vector
  Vectors.dense(...)
}

// the resulting data set is about 40k vectors

KMeans.train(data, k=5000, maxIterations=500)

I just kill my application. In the log I found this:

15/03/26 *11:42:43* INFO storage.BlockManagerMaster: Updated info of block
broadcast_26_piece0
15/03/26 *23:02:57* WARN server.TransportChannelHandler: Exception in
connection from
workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net/100.72.84.107:56277
java.io.IOException: An existing connection was forcibly closed by the
remote host

Notice the time gap. I think it means the work node did not generate any
log at all for about 12hrs...does it mean they are not working at all?

But when testing with very small data set, my application works and output
expected data.


Thanks,
David


On Fri, Mar 27, 2015 at 10:04 AM Burak Yavuz brk...@gmail.com wrote:

 Can you share the code snippet of how you call k-means? Do you cache the
 data before k-means? Did you repartition the data?
 On Mar 26, 2015 4:02 PM, Xi Shen davidshe...@gmail.com wrote:

 OH, the job I talked about has ran more than 11 hrs without a result...it
 doesn't make sense.


 On Fri, Mar 27, 2015 at 9:48 AM Xi Shen davidshe...@gmail.com wrote:

 Hi Burak,

 My iterations is set to 500. But I think it should also stop of the
 centroid coverages, right?

 My spark is 1.2.0, working in windows 64 bit. My data set is about 40k
 vectors, each vector has about 300 features, all normalised. All work node
 have sufficient memory and disk space.

 Thanks,
 David

 On Fri, 27 Mar 2015 02:48 Burak Yavuz brk...@gmail.com wrote:

 Hi David,

 When the number of runs are large and the data is not properly
 partitioned, it seems that K-Means is hanging according to my experience.
 Especially setting the number of runs to something high drastically
 increases the work in executors. If that's not the case, can you give more
 info on what Spark version you are using, your setup, and your dataset?

 Thanks,
 Burak
 On Mar 26, 2015 5:10 AM, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 When I run k-means cluster with Spark, I got this in the last two
 lines in the log:

 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26
 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5



 Then it hangs for a long time. There's no active job. The driver
 machine is idle. I cannot access the work node, I am not sure if they are
 busy.

 I understand k-means may take a long time to finish. But why no active
 job? no log?


 Thanks,
 David

Re: How to troubleshoot server.TransportChannelHandler Exception

2015-03-26 Thread Xi Shen

ah~hell, I am using Spark 1.2.0, and my job was submitted to use 8
cores...the magic number in the bug.




[image: --]
Xi Shen
[image: http://]about.me/davidshen
http://about.me/davidshen?promo=email_sig
  http://about.me/davidshen

On Thu, Mar 26, 2015 at 5:48 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Whats your spark version? Not quiet sure, but you could be hitting this
 issue
 https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-4516
 On 26 Mar 2015 11:01, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 My environment is Windows 64bit, Spark + YARN. I had a job that takes a
 long time. It starts well, but it ended with below exception:

 15/03/25 12:39:09 WARN server.TransportChannelHandler: Exception in
 connection from
 headnode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net/100.72.68.34:58507
 java.io.IOException: An existing connection was forcibly closed by the
 remote host
 at sun.nio.ch.SocketDispatcher.read0(Native Method)
 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43)
 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
 at sun.nio.ch.IOUtil.read(IOUtil.java:192)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
 at
 io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
 at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
 at
 io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
 at
 io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
 at
 io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
 at
 io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
 at
 io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
 at
 io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
 at java.lang.Thread.run(Thread.java:745)
 15/03/25 12:39:09 ERROR executor.CoarseGrainedExecutorBackend: Driver
 Disassociated [akka.tcp://
 sparkexecu...@workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net:65469]
 - [akka.tcp://
 sparkdri...@headnode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net:58467]
 disassociated! Shutting down.
 15/03/25 12:39:09 WARN remote.ReliableDeliverySupervisor: Association
 with remote system [akka.tcp://
 sparkdri...@headnode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net:58467]
 has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

 Interestingly, the job is shown as Succeeded in the RM. I checked the
 application log, it is miles long, and this is the only exception I found.
 And it is no very useful to help me pin point the problem.

 Any idea what would be the cause?


 Thanks,


 [image: --]
 Xi Shen
 [image: http://]about.me/davidshen
 http://about.me/davidshen?promo=email_sig
   http://about.me/davidshen

Re: How to deploy binary dependencies to workers?

2015-03-25 Thread Xi Shen

[image: --]
Xi Shen
[image: http://]about.me/davidshen
http://about.me/davidshen?promo=email_sig
http://about.me/davidshen

On Thu, Mar 26, 2015 at 11:11 AM, DB Tsai dbt...@dbtsai.com wrote:

Are you deploying the windows dll to linux machine?

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com

On Wed, Mar 25, 2015 at 3:57 AM, Xi Shen davidshe...@gmail.com wrote:
I think you meant to use the --files to deploy the DLLs. I gave a try,
but
it did not work.

From the Spark UI, Environment tab, I can see

spark.yarn.dist.files

file:/c:/openblas/libgcc_s_seh-1.dll,file:/c:/openblas/libblas3.dll,file:/c:/openblas/libgfortran-3.dll,file:/c:/openblas/liblapack3.dll,file:/c:/openblas/libquadmath-0.dll

I think my DLLs are all deployed. But I still got the warn message that
native BLAS library cannot be load.

And idea?

Thanks,
David

On Wed, Mar 25, 2015 at 5:40 AM DB Tsai dbt...@dbtsai.com wrote:

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com

On Tue, Mar 24, 2015 at 4:13 AM, Xi Shen davidshe...@gmail.com wrote:
Hi,

I am doing ML using Spark mllib. However, I do not have full control
to
the
cluster. I am using Microsoft Azure HDInsight

I want to deploy the BLAS or whatever required dependencies to
accelerate
the computation. But I don't know how to deploy those DLLs when I
submit
my
JAR to the cluster.

I know how to pack those DLLs into a jar. The real challenge is how to
let
the system find them...

Thanks,
David

How to troubleshoot server.TransportChannelHandler Exception

2015-03-25 Thread Xi Shen

Hi,

My environment is Windows 64bit, Spark + YARN. I had a job that takes a
long time. It starts well, but it ended with below exception:

15/03/25 12:39:09 WARN server.TransportChannelHandler: Exception in
connection from
headnode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net/100.72.68.34:58507
java.io.IOException: An existing connection was forcibly closed by the
remote host
at sun.nio.ch.SocketDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
at
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)
15/03/25 12:39:09 ERROR executor.CoarseGrainedExecutorBackend: Driver
Disassociated [akka.tcp://
sparkexecu...@workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net:65469]
- [akka.tcp://
sparkdri...@headnode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net:58467]
disassociated! Shutting down.
15/03/25 12:39:09 WARN remote.ReliableDeliverySupervisor: Association with
remote system [akka.tcp://
sparkdri...@headnode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net:58467]
has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

Interestingly, the job is shown as Succeeded in the RM. I checked the
application log, it is miles long, and this is the only exception I found.
And it is no very useful to help me pin point the problem.

Any idea what would be the cause?


Thanks,


[image: --]
Xi Shen
[image: http://]about.me/davidshen
http://about.me/davidshen?promo=email_sig
  http://about.me/davidshen

Re: issue while submitting Spark Job as --master yarn-cluster

2015-03-25 Thread Xi Shen

What is your environment? I remember I had similar error when running
spark-shell --master yarn-client in Windows environment.


On Wed, Mar 25, 2015 at 9:07 PM sachin Singh sachin.sha...@gmail.com
wrote:

 Hi ,
 when I am submitting spark job in cluster mode getting error as under in
 hadoop-yarn  log,
 someone has any idea,please suggest,

 2015-03-25 23:35:22,467 INFO
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
 application_1427124496008_0028 State change from FINAL_SAVING to FAILED
 2015-03-25 23:35:22,467 WARN
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hdfs
 OPERATION=Application Finished - Failed TARGET=RMAppManager
  RESULT=FAILURE
 DESCRIPTION=App failed with state: FAILED   PERMISSIONS=Application
 application_1427124496008_0028 failed 2 times due to AM Container for
 appattempt_1427124496008_0028_02 exited with  exitCode: 13 due to:
 Exception from container-launch.
 Container id: container_1427124496008_0028_02_01
 Exit code: 13
 Stack trace: ExitCodeException exitCode=13:
 at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
 at org.apache.hadoop.util.Shell.run(Shell.java:455)
 at
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
 at
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.
 launchContainer(DefaultContainerExecutor.java:197)
 at
 org.apache.hadoop.yarn.server.nodemanager.containermanager.
 launcher.ContainerLaunch.call(ContainerLaunch.java:299)
 at
 org.apache.hadoop.yarn.server.nodemanager.containermanager.
 launcher.ContainerLaunch.call(ContainerLaunch.java:81)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(
 ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(
 ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)


 Container exited with a non-zero exit code 13
 .Failing this attempt.. Failing the application.
 APPID=application_1427124496008_0028



 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/issue-while-submitting-Spark-Job-as-
 master-yarn-cluster-tp0.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: How to deploy binary dependencies to workers?

2015-03-25 Thread Xi Shen

I think you meant to use the --files to deploy the DLLs. I gave a try,
but it did not work.

From the Spark UI, Environment tab, I can see

spark.yarn.dist.files

file:/c:/openblas/libgcc_s_seh-1.dll,file:/c:/openblas/libblas3.dll,file:/c:/openblas/libgfortran-3.dll,file:/c:/openblas/liblapack3.dll,file:/c:/openblas/libquadmath-0.dll

I think my DLLs are all deployed. But I still got the warn message that
native BLAS library cannot be load.

And idea?


Thanks,
David


On Wed, Mar 25, 2015 at 5:40 AM DB Tsai dbt...@dbtsai.com wrote:

 I would recommend to upload those jars to HDFS, and use add jars
 option in spark-submit with URI from HDFS instead of URI from local
 filesystem. Thus, it can avoid the problem of fetching jars from
 driver which can be a bottleneck.

 Sincerely,

 DB Tsai
 ---
 Blog: https://www.dbtsai.com


 On Tue, Mar 24, 2015 at 4:13 AM, Xi Shen davidshe...@gmail.com wrote:
  Hi,
 
  I am doing ML using Spark mllib. However, I do not have full control to
 the
  cluster. I am using Microsoft Azure HDInsight
 
  I want to deploy the BLAS or whatever required dependencies to accelerate
  the computation. But I don't know how to deploy those DLLs when I submit
 my
  JAR to the cluster.
 
  I know how to pack those DLLs into a jar. The real challenge is how to
 let
  the system find them...
 
 
  Thanks,
  David

How to deploy binary dependencies to workers?

2015-03-24 Thread Xi Shen

Hi,

I am doing ML using Spark mllib. However, I do not have full control to the
cluster. I am using Microsoft Azure HDInsight

I want to deploy the BLAS or whatever required dependencies to accelerate
the computation. But I don't know how to deploy those DLLs when I submit my
JAR to the cluster.

I know how to pack those DLLs into a jar. The real challenge is how to let
the system find them...


Thanks,
David

Re: netlib-java cannot load native lib in Windows when using spark-submit

2015-03-23 Thread Xi Shen

I did not build my own Spark. I got the binary version online. If it can
load the native libs from IDE, it should also be able to load native when
running with --matter local.

On Mon, 23 Mar 2015 07:15 Burak Yavuz brk...@gmail.com wrote:

 Did you build Spark with: -Pnetlib-lgpl?

 Ref: https://spark.apache.org/docs/latest/mllib-guide.html

 Burak

 On Sun, Mar 22, 2015 at 7:37 AM, Ted Yu yuzhih...@gmail.com wrote:

 How about pointing LD_LIBRARY_PATH to native lib folder ?

 You need Spark 1.2.0 or higher for the above to work. See SPARK-1719

 Cheers

 On Sun, Mar 22, 2015 at 4:02 AM, Xi Shen davidshe...@gmail.com wrote:

 Hi Ted,

 I have tried to invoke the command from both cygwin environment and
 powershell environment. I still get the messages:

 15/03/22 21:56:00 WARN netlib.BLAS: Failed to load implementation from:
 com.github.fommil.netlib.NativeSystemBLAS
 15/03/22 21:56:00 WARN netlib.BLAS: Failed to load implementation from:
 com.github.fommil.netlib.NativeRefBLAS

 From the Spark UI, I can see:

   spark.driver.extraLibrary c:\openblas


 Thanks,
 David


 On Sun, Mar 22, 2015 at 11:45 AM Ted Yu yuzhih...@gmail.com wrote:

 Can you try the --driver-library-path option ?

 spark-submit --driver-library-path /opt/hadoop/lib/native ...

 Cheers

 On Sat, Mar 21, 2015 at 4:58 PM, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 I use the *OpenBLAS* DLL, and have configured my application to work
 in IDE. When I start my Spark application from IntelliJ IDE, I can see in
 the log that the native lib is loaded successfully.

 But if I use *spark-submit* to start my application, the native lib
 still cannot be load. I saw the WARN message that it failed to load both
 the native and native-ref library. I checked the *Environment* tab in
 the Spark UI, and the *java.library.path* is set correctly.


 Thanks,

 David

Re: netlib-java cannot load native lib in Windows when using spark-submit

2015-03-22 Thread Xi Shen

Hi Ted,

I have tried to invoke the command from both cygwin environment and
powershell environment. I still get the messages:

15/03/22 21:56:00 WARN netlib.BLAS: Failed to load implementation from:
com.github.fommil.netlib.NativeSystemBLAS
15/03/22 21:56:00 WARN netlib.BLAS: Failed to load implementation from:
com.github.fommil.netlib.NativeRefBLAS

From the Spark UI, I can see:

  spark.driver.extraLibrary c:\openblas


Thanks,
David


On Sun, Mar 22, 2015 at 11:45 AM Ted Yu yuzhih...@gmail.com wrote:

 Can you try the --driver-library-path option ?

 spark-submit --driver-library-path /opt/hadoop/lib/native ...

 Cheers

 On Sat, Mar 21, 2015 at 4:58 PM, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 I use the *OpenBLAS* DLL, and have configured my application to work in
 IDE. When I start my Spark application from IntelliJ IDE, I can see in the
 log that the native lib is loaded successfully.

 But if I use *spark-submit* to start my application, the native lib
 still cannot be load. I saw the WARN message that it failed to load both
 the native and native-ref library. I checked the *Environment* tab in
 the Spark UI, and the *java.library.path* is set correctly.


 Thanks,

 David

Re: How to set Spark executor memory?

2015-03-22 Thread Xi Shen

OK, I actually got the answer days ago from StackOverflow, but I did not
check it :(

When running in local mode, to set the executor memory

- when using spark-submit, use --driver-memory
- when running as a Java application, like executing from IDE, set the
-Xmx vm option


Thanks,
David


On Sun, Mar 22, 2015 at 2:10 PM Ted Yu yuzhih...@gmail.com wrote:

 bq. the BLAS native cannot be loaded

 Have you tried specifying --driver-library-path option ?

 Cheers

 On Sat, Mar 21, 2015 at 4:42 PM, Xi Shen davidshe...@gmail.com wrote:

 Yeah, I think it is harder to troubleshot the properties issues in a IDE.
 But the reason I stick to IDE is because if I use spark-submit, the BLAS
 native cannot be loaded. May be I should open another thread to discuss
 that.

 Thanks,
 David

 On Sun, 22 Mar 2015 10:38 Xi Shen davidshe...@gmail.com wrote:

 In the log, I saw

   MemoryStorage: MemoryStore started with capacity 6.7GB

 But I still can not find where to set this storage capacity.

 On Sat, 21 Mar 2015 20:30 Xi Shen davidshe...@gmail.com wrote:

 Hi Sean,

 It's getting strange now. If I ran from IDE, my executor memory is
 always set to 6.7G, no matter what value I set in code. I have check my
 environment variable, and there's no value of 6.7, or 12.5

 Any idea?

 Thanks,
 David

 On Tue, 17 Mar 2015 00:35 null jishnu.prat...@wipro.com wrote:

  Hi Xi Shen,

 You could set the spark.executor.memory in the code itself . new 
 SparkConf()..set(spark.executor.memory, 2g)

 Or you can try the -- spark.executor.memory 2g while submitting the
 jar.



 Regards

 Jishnu Prathap



 *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
 *Sent:* Monday, March 16, 2015 2:06 PM
 *To:* Xi Shen
 *Cc:* user@spark.apache.org
 *Subject:* Re: How to set Spark executor memory?



 By default spark.executor.memory is set to 512m, I'm assuming since
 you are submiting the job using spark-submit and it is not able to 
 override
 the value since you are running in local mode. Can you try it without 
 using
 spark-submit as a standalone project?


   Thanks

 Best Regards



 On Mon, Mar 16, 2015 at 1:52 PM, Xi Shen davidshe...@gmail.com
 wrote:

 I set it in code, not by configuration. I submit my jar file to local.
 I am working in my developer environment.



 On Mon, 16 Mar 2015 18:28 Akhil Das ak...@sigmoidanalytics.com
 wrote:

 How are you setting it? and how are you submitting the job?


   Thanks

 Best Regards



 On Mon, Mar 16, 2015 at 12:52 PM, Xi Shen davidshe...@gmail.com
 wrote:

 Hi,



 I have set spark.executor.memory to 2048m, and in the UI Environment
 page, I can see this value has been set correctly. But in the Executors
 page, I saw there's only 1 executor and its memory is 265.4MB. Very 
 strange
 value. why not 256MB, or just as what I set?



 What am I missing here?





 Thanks,

 David






  The information contained in this electronic message and any
 attachments to this message are intended for the exclusive use of the
 addressee(s) and may contain proprietary, confidential or privileged
 information. If you are not the intended recipient, you should not
 disseminate, distribute or copy this e-mail. Please notify the sender
 immediately and destroy all copies of this message and any attachments.
 WARNING: Computer viruses can be transmitted via email. The recipient
 should check this email and any attachments for the presence of viruses.
 The company accepts no liability for any damage caused by any virus
 transmitted by this email. www.wipro.com

Re: How to do nested foreach with RDD

2015-03-22 Thread Xi Shen

Hi Reza,

Yes, I just found RDD.cartesian(). Very useful.

Thanks,
David


On Sun, Mar 22, 2015 at 5:08 PM Reza Zadeh r...@databricks.com wrote:

 You can do this with the 'cartesian' product method on RDD. For example:

 val rdd1 = ...
 val rdd2 = ...

 val combinations = rdd1.cartesian(rdd2).filter{ case (a,b) = a  b }

 Reza

 On Sat, Mar 21, 2015 at 10:37 PM, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 I have two big RDD, and I need to do some math against each pair of them.
 Traditionally, it is like a nested for-loop. But for RDD, it cause a nested
 RDD which is prohibited.

 Currently, I am collecting one of them, then do a nested for-loop, so to
 avoid nested RDD. But would like to know if there's spark-way to do this.


 Thanks,
 David

Re: How to set Spark executor memory?

2015-03-21 Thread Xi Shen

Hi Sean,

It's getting strange now. If I ran from IDE, my executor memory is always
set to 6.7G, no matter what value I set in code. I have check my
environment variable, and there's no value of 6.7, or 12.5

Any idea?

Thanks,
David

On Tue, 17 Mar 2015 00:35 null jishnu.prat...@wipro.com wrote:

  Hi Xi Shen,

 You could set the spark.executor.memory in the code itself . new 
 SparkConf()..set(spark.executor.memory, 2g)

 Or you can try the -- spark.executor.memory 2g while submitting the jar.



 Regards

 Jishnu Prathap



 *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
 *Sent:* Monday, March 16, 2015 2:06 PM
 *To:* Xi Shen
 *Cc:* user@spark.apache.org
 *Subject:* Re: How to set Spark executor memory?



 By default spark.executor.memory is set to 512m, I'm assuming since you
 are submiting the job using spark-submit and it is not able to override the
 value since you are running in local mode. Can you try it without using
 spark-submit as a standalone project?


   Thanks

 Best Regards



 On Mon, Mar 16, 2015 at 1:52 PM, Xi Shen davidshe...@gmail.com wrote:

 I set it in code, not by configuration. I submit my jar file to local. I
 am working in my developer environment.



 On Mon, 16 Mar 2015 18:28 Akhil Das ak...@sigmoidanalytics.com wrote:

 How are you setting it? and how are you submitting the job?


   Thanks

 Best Regards



 On Mon, Mar 16, 2015 at 12:52 PM, Xi Shen davidshe...@gmail.com wrote:

 Hi,



 I have set spark.executor.memory to 2048m, and in the UI Environment
 page, I can see this value has been set correctly. But in the Executors
 page, I saw there's only 1 executor and its memory is 265.4MB. Very strange
 value. why not 256MB, or just as what I set?



 What am I missing here?





 Thanks,

 David






  The information contained in this electronic message and any attachments
 to this message are intended for the exclusive use of the addressee(s) and
 may contain proprietary, confidential or privileged information. If you are
 not the intended recipient, you should not disseminate, distribute or copy
 this e-mail. Please notify the sender immediately and destroy all copies of
 this message and any attachments. WARNING: Computer viruses can be
 transmitted via email. The recipient should check this email and any
 attachments for the presence of viruses. The company accepts no liability
 for any damage caused by any virus transmitted by this email.
 www.wipro.com

Re: Can I start multiple executors in local mode?

2015-03-21 Thread Xi Shen

No, I didn't mean local cluster. I mean run in local, like in IDE.

On Mon, 16 Mar 2015 23:12 xu Peng hsxup...@gmail.com wrote:

 Hi David,

 You can try the local-cluster.

 the number in local-cluster[2,2,1024] represents that there are 2 worker,
 2 cores and 1024M

 Best Regards

 Peng Xu

 2015-03-16 19:46 GMT+08:00 Xi Shen davidshe...@gmail.com:

 Hi,

 In YARN mode you can specify the number of executors. I wonder if we can
 also start multiple executors at local, just to make the test run faster.

 Thanks,
 David

Re: How to set Spark executor memory?

2015-03-21 Thread Xi Shen

In the log, I saw

  MemoryStorage: MemoryStore started with capacity 6.7GB

But I still can not find where to set this storage capacity.

On Sat, 21 Mar 2015 20:30 Xi Shen davidshe...@gmail.com wrote:

 Hi Sean,

 It's getting strange now. If I ran from IDE, my executor memory is always
 set to 6.7G, no matter what value I set in code. I have check my
 environment variable, and there's no value of 6.7, or 12.5

 Any idea?

 Thanks,
 David

 On Tue, 17 Mar 2015 00:35 null jishnu.prat...@wipro.com wrote:

  Hi Xi Shen,

 You could set the spark.executor.memory in the code itself . new 
 SparkConf()..set(spark.executor.memory, 2g)

 Or you can try the -- spark.executor.memory 2g while submitting the jar.



 Regards

 Jishnu Prathap



 *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
 *Sent:* Monday, March 16, 2015 2:06 PM
 *To:* Xi Shen
 *Cc:* user@spark.apache.org
 *Subject:* Re: How to set Spark executor memory?



 By default spark.executor.memory is set to 512m, I'm assuming since you
 are submiting the job using spark-submit and it is not able to override the
 value since you are running in local mode. Can you try it without using
 spark-submit as a standalone project?


   Thanks

 Best Regards



 On Mon, Mar 16, 2015 at 1:52 PM, Xi Shen davidshe...@gmail.com wrote:

 I set it in code, not by configuration. I submit my jar file to local. I
 am working in my developer environment.



 On Mon, 16 Mar 2015 18:28 Akhil Das ak...@sigmoidanalytics.com wrote:

 How are you setting it? and how are you submitting the job?


   Thanks

 Best Regards



 On Mon, Mar 16, 2015 at 12:52 PM, Xi Shen davidshe...@gmail.com wrote:

 Hi,



 I have set spark.executor.memory to 2048m, and in the UI Environment
 page, I can see this value has been set correctly. But in the Executors
 page, I saw there's only 1 executor and its memory is 265.4MB. Very strange
 value. why not 256MB, or just as what I set?



 What am I missing here?





 Thanks,

 David






  The information contained in this electronic message and any
 attachments to this message are intended for the exclusive use of the
 addressee(s) and may contain proprietary, confidential or privileged
 information. If you are not the intended recipient, you should not
 disseminate, distribute or copy this e-mail. Please notify the sender
 immediately and destroy all copies of this message and any attachments.
 WARNING: Computer viruses can be transmitted via email. The recipient
 should check this email and any attachments for the presence of viruses.
 The company accepts no liability for any damage caused by any virus
 transmitted by this email. www.wipro.com

netlib-java cannot load native lib in Windows when using spark-submit

2015-03-21 Thread Xi Shen

Hi,

I use the *OpenBLAS* DLL, and have configured my application to work in
IDE. When I start my Spark application from IntelliJ IDE, I can see in the
log that the native lib is loaded successfully.

But if I use *spark-submit* to start my application, the native lib still
cannot be load. I saw the WARN message that it failed to load both the
native and native-ref library. I checked the *Environment* tab in the Spark
UI, and the *java.library.path* is set correctly.


Thanks,

David

Re: How to set Spark executor memory?

2015-03-21 Thread Xi Shen

Yeah, I think it is harder to troubleshot the properties issues in a IDE.
But the reason I stick to IDE is because if I use spark-submit, the BLAS
native cannot be loaded. May be I should open another thread to discuss
that.

Thanks,
David

On Sun, 22 Mar 2015 10:38 Xi Shen davidshe...@gmail.com wrote:

 In the log, I saw

   MemoryStorage: MemoryStore started with capacity 6.7GB

 But I still can not find where to set this storage capacity.

 On Sat, 21 Mar 2015 20:30 Xi Shen davidshe...@gmail.com wrote:

 Hi Sean,

 It's getting strange now. If I ran from IDE, my executor memory is always
 set to 6.7G, no matter what value I set in code. I have check my
 environment variable, and there's no value of 6.7, or 12.5

 Any idea?

 Thanks,
 David

 On Tue, 17 Mar 2015 00:35 null jishnu.prat...@wipro.com wrote:

  Hi Xi Shen,

 You could set the spark.executor.memory in the code itself . new 
 SparkConf()..set(spark.executor.memory, 2g)

 Or you can try the -- spark.executor.memory 2g while submitting the jar.



 Regards

 Jishnu Prathap



 *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
 *Sent:* Monday, March 16, 2015 2:06 PM
 *To:* Xi Shen
 *Cc:* user@spark.apache.org
 *Subject:* Re: How to set Spark executor memory?



 By default spark.executor.memory is set to 512m, I'm assuming since you
 are submiting the job using spark-submit and it is not able to override the
 value since you are running in local mode. Can you try it without using
 spark-submit as a standalone project?


   Thanks

 Best Regards



 On Mon, Mar 16, 2015 at 1:52 PM, Xi Shen davidshe...@gmail.com wrote:

 I set it in code, not by configuration. I submit my jar file to local. I
 am working in my developer environment.



 On Mon, 16 Mar 2015 18:28 Akhil Das ak...@sigmoidanalytics.com wrote:

 How are you setting it? and how are you submitting the job?


   Thanks

 Best Regards



 On Mon, Mar 16, 2015 at 12:52 PM, Xi Shen davidshe...@gmail.com wrote:

 Hi,



 I have set spark.executor.memory to 2048m, and in the UI Environment
 page, I can see this value has been set correctly. But in the Executors
 page, I saw there's only 1 executor and its memory is 265.4MB. Very strange
 value. why not 256MB, or just as what I set?



 What am I missing here?





 Thanks,

 David






  The information contained in this electronic message and any
 attachments to this message are intended for the exclusive use of the
 addressee(s) and may contain proprietary, confidential or privileged
 information. If you are not the intended recipient, you should not
 disseminate, distribute or copy this e-mail. Please notify the sender
 immediately and destroy all copies of this message and any attachments.
 WARNING: Computer viruses can be transmitted via email. The recipient
 should check this email and any attachments for the presence of viruses.
 The company accepts no liability for any damage caused by any virus
 transmitted by this email. www.wipro.com

How to do nested foreach with RDD

2015-03-21 Thread Xi Shen

Hi,

I have two big RDD, and I need to do some math against each pair of them.
Traditionally, it is like a nested for-loop. But for RDD, it cause a nested
RDD which is prohibited.

Currently, I am collecting one of them, then do a nested for-loop, so to
avoid nested RDD. But would like to know if there's spark-way to do this.


Thanks,
David

Suggestion for user logging

2015-03-16 Thread Xi Shen

Hi,

When you submit a jar to the spark cluster, it is very difficult to see the
logging. Is there any way to save the logging to a file? I mean only the
logging I created not the Spark log information.


Thanks,
David

Re: k-means hang without error/warning

2015-03-16 Thread Xi Shen

Hi Sean,

My system is windows 64 bit. I looked into the resource manager, Java is
the only process that used about 13% CPU recourse; no disk activity related
to Java; only about 6GB memory used out of 56GB in total.

My system response very well. I don't think it is a system issue.

Thanks,
David

On Mon, 16 Mar 2015 22:30 Sean Owen so...@cloudera.com wrote:

 I think you'd have to say more about stopped working. Is the GC
 thrashing? does the UI respond? is the CPU busy or not?

 On Mon, Mar 16, 2015 at 4:25 AM, Xi Shen davidshe...@gmail.com wrote:
  Hi,
 
  I am running k-means using Spark in local mode. My data set is about 30k
  records, and I set the k = 1000.
 
  The algorithm starts and finished 13 jobs according to the UI monitor,
 then
  it stopped working.
 
  The last log I saw was:
 
  [Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned
  broadcast 16
 
  There're many similar log repeated, but it seems it always stop at the
 16th.
 
  If I try to low down the k value, the algorithm will terminated. So I
 just
  want to know what's wrong with k=1000.
 
 
  Thanks,
  David

Can I start multiple executors in local mode?

2015-03-16 Thread Xi Shen

Hi,

In YARN mode you can specify the number of executors. I wonder if we can
also start multiple executors at local, just to make the test run faster.

Thanks,
David

How to set Spark executor memory?

2015-03-16 Thread Xi Shen

Hi,

I have set spark.executor.memory to 2048m, and in the UI Environment
page, I can see this value has been set correctly. But in the Executors
page, I saw there's only 1 executor and its memory is 265.4MB. Very strange
value. why not 256MB, or just as what I set?

What am I missing here?


Thanks,
David

Re: k-means hang without error/warning

2015-03-16 Thread Xi Shen

I used local[*]. The CPU hits about 80% when there are active jobs, then
it drops to about 13% and hand for a very long time.

Thanks,
David

On Mon, 16 Mar 2015 17:46 Akhil Das ak...@sigmoidanalytics.com wrote:

 How many threads are you allocating while creating the sparkContext? like
 local[4] will allocate 4 threads. You can try increasing it to a higher
 number also try setting level of parallelism to a higher number.

 Thanks
 Best Regards

 On Mon, Mar 16, 2015 at 9:55 AM, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 I am running k-means using Spark in local mode. My data set is about 30k
 records, and I set the k = 1000.

 The algorithm starts and finished 13 jobs according to the UI monitor,
 then it stopped working.

 The last log I saw was:

 [Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned
 broadcast *16*

 There're many similar log repeated, but it seems it always stop at the
 16th.

 If I try to low down the *k* value, the algorithm will terminated. So I
 just want to know what's wrong with *k=1000*.


 Thanks,
 David

Re: How to set Spark executor memory?

2015-03-16 Thread Xi Shen

I set it in code, not by configuration. I submit my jar file to local. I am
working in my developer environment.

On Mon, 16 Mar 2015 18:28 Akhil Das ak...@sigmoidanalytics.com wrote:

 How are you setting it? and how are you submitting the job?

 Thanks
 Best Regards

 On Mon, Mar 16, 2015 at 12:52 PM, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 I have set spark.executor.memory to 2048m, and in the UI Environment
 page, I can see this value has been set correctly. But in the Executors
 page, I saw there's only 1 executor and its memory is 265.4MB. Very strange
 value. why not 256MB, or just as what I set?

 What am I missing here?


 Thanks,
 David

Re: How to set Spark executor memory?

2015-03-16 Thread Xi Shen

Hi Akhil,

Yes, you are right. If I ran the program from IDE as a normal java program,
the executor's memory is increased...but not to 2048m, it is set to
6.7GB...Looks like there's some formula to calculate this value.


Thanks,
David


On Mon, Mar 16, 2015 at 7:36 PM Akhil Das ak...@sigmoidanalytics.com
wrote:

 By default spark.executor.memory is set to 512m, I'm assuming since you
 are submiting the job using spark-submit and it is not able to override the
 value since you are running in local mode. Can you try it without using
 spark-submit as a standalone project?

 Thanks
 Best Regards

 On Mon, Mar 16, 2015 at 1:52 PM, Xi Shen davidshe...@gmail.com wrote:

 I set it in code, not by configuration. I submit my jar file to local. I
 am working in my developer environment.

 On Mon, 16 Mar 2015 18:28 Akhil Das ak...@sigmoidanalytics.com wrote:

 How are you setting it? and how are you submitting the job?

 Thanks
 Best Regards

 On Mon, Mar 16, 2015 at 12:52 PM, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 I have set spark.executor.memory to 2048m, and in the UI Environment
 page, I can see this value has been set correctly. But in the Executors
 page, I saw there's only 1 executor and its memory is 265.4MB. Very strange
 value. why not 256MB, or just as what I set?

 What am I missing here?


 Thanks,
 David

Re: How to set Spark executor memory?

2015-03-16 Thread Xi Shen

I set spark.executor.memory to 2048m. If the executor storage memory is
0.6 of executor memory, it should be 2g * 0.6 = 1.2g.

My machine has 56GB memory, and 0.6 of that should be 33.6G...I hate math xD


On Mon, Mar 16, 2015 at 7:59 PM Akhil Das ak...@sigmoidanalytics.com
wrote:

 How much memory are you having on your machine? I think default value is
 0.6 of the spark.executor.memory as you can see from here
 http://spark.apache.org/docs/1.2.1/configuration.html#execution-behavior
 .

 Thanks
 Best Regards

 On Mon, Mar 16, 2015 at 2:26 PM, Xi Shen davidshe...@gmail.com wrote:

 Hi Akhil,

 Yes, you are right. If I ran the program from IDE as a normal java
 program, the executor's memory is increased...but not to 2048m, it is set
 to 6.7GB...Looks like there's some formula to calculate this value.


 Thanks,
 David


 On Mon, Mar 16, 2015 at 7:36 PM Akhil Das ak...@sigmoidanalytics.com
 wrote:

 By default spark.executor.memory is set to 512m, I'm assuming since you
 are submiting the job using spark-submit and it is not able to override the
 value since you are running in local mode. Can you try it without using
 spark-submit as a standalone project?

 Thanks
 Best Regards

 On Mon, Mar 16, 2015 at 1:52 PM, Xi Shen davidshe...@gmail.com wrote:

 I set it in code, not by configuration. I submit my jar file to local.
 I am working in my developer environment.

 On Mon, 16 Mar 2015 18:28 Akhil Das ak...@sigmoidanalytics.com wrote:

 How are you setting it? and how are you submitting the job?

 Thanks
 Best Regards

 On Mon, Mar 16, 2015 at 12:52 PM, Xi Shen davidshe...@gmail.com
 wrote:

 Hi,

 I have set spark.executor.memory to 2048m, and in the UI
 Environment page, I can see this value has been set correctly. But in 
 the
 Executors page, I saw there's only 1 executor and its memory is 
 265.4MB.
 Very strange value. why not 256MB, or just as what I set?

 What am I missing here?


 Thanks,
 David

k-means hang without error/warning

2015-03-15 Thread Xi Shen

Hi,

I am running k-means using Spark in local mode. My data set is about 30k
records, and I set the k = 1000.

The algorithm starts and finished 13 jobs according to the UI monitor, then
it stopped working.

The last log I saw was:

[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned
broadcast *16*

There're many similar log repeated, but it seems it always stop at the 16th.

If I try to low down the *k* value, the algorithm will terminated. So I
just want to know what's wrong with *k=1000*.


Thanks,
David

Please help me understand TF-IDF Vector structure

2015-03-14 Thread Xi Shen

Hi,

I read this document,
http://spark.apache.org/docs/1.2.1/mllib-feature-extraction.html, and tried
to build a TF-IDF model of my documents.

I have a list of documents, each word is represented as a Int, and each
document is listed in one line.

doc_name, int1, int2...
doc_name, int3, int4...

This is how I load my documents:
val documents: RDD[Seq[Int]] = sc.objectFile[(String,
Seq[Int])](s$sparkStore/documents) map (_._2) cache()

Then I did:

val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documents)
val idf = new IDF().fit(tf)
val tfidf = idf.transform(tf)

I write the tfidf model to a text file and try to understand the structure.
FileUtils.writeLines(new File(tfidf.out),
tfidf.collect().toList.asJavaCollection)

What I is something like:

(1048576,[0,4,7,8,10,13,17,21],[...some float numbers...])
...

I think it s a tuple with 3 element.

   - I have no idea what the 1st element is...
   - I think the 2nd element is a list of the word
   - I think the 3rd element is a list of tf-idf value of the words in the
   previous list

Please help me understand this structure.


Thanks,
David

Re: Please help me understand TF-IDF Vector structure

2015-03-14 Thread Xi Shen

Hey, I work it out myself :)

The Vector is actually a SparesVector, so when it is written into a
string, the format is

(size, [coordinate], [value...])


Simple!


On Sat, Mar 14, 2015 at 6:05 PM Xi Shen davidshe...@gmail.com wrote:

 Hi,

 I read this document,
 http://spark.apache.org/docs/1.2.1/mllib-feature-extraction.html, and
 tried to build a TF-IDF model of my documents.

 I have a list of documents, each word is represented as a Int, and each
 document is listed in one line.

 doc_name, int1, int2...
 doc_name, int3, int4...

 This is how I load my documents:
 val documents: RDD[Seq[Int]] = sc.objectFile[(String,
 Seq[Int])](s$sparkStore/documents) map (_._2) cache()

 Then I did:

 val hashingTF = new HashingTF()
 val tf: RDD[Vector] = hashingTF.transform(documents)
 val idf = new IDF().fit(tf)
 val tfidf = idf.transform(tf)

 I write the tfidf model to a text file and try to understand the structure.
 FileUtils.writeLines(new File(tfidf.out),
 tfidf.collect().toList.asJavaCollection)

 What I is something like:

 (1048576,[0,4,7,8,10,13,17,21],[...some float numbers...])
 ...

 I think it s a tuple with 3 element.

- I have no idea what the 1st element is...
- I think the 2nd element is a list of the word
- I think the 3rd element is a list of tf-idf value of the words in
the previous list

 Please help me understand this structure.


 Thanks,
 David

How to do spares vector product in Spark?

2015-03-13 Thread Xi Shen

Hi,

I have two RDD[Vector], both Vector are spares and of the form:

(id, value)

id indicates the position of the value in the vector space. I want to
apply dot product on two of such RDD[Vector] and get a scale value. The
none exist values are treated as zero.

Any convenient tool to do this in Spark?


Thanks,
David

How to use the TF-IDF model?

2015-03-09 Thread Xi Shen

Hi,

I read this page,
http://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html. But I am
wondering, how to use this TF-IDF RDD? What is this TF-IDF vector looks
like?

Can someone provide me some guide?


Thanks,


[image: --]
Xi Shen
[image: http://]about.me/davidshen
http://about.me/davidshen?promo=email_sig
  http://about.me/davidshen

How to load my ML model?

2015-03-09 Thread Xi Shen

Hi,

I used the method on this
http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/train.html
passage to save my k-means model.

But now, I have no idea how to load it back...I tried

sc.objectFile(/path/to/data/file/directory/)


But I got this error:

org.apache.spark.SparkDriverExecutionException: Execution error
at
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:997)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:14
17)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.ArrayStoreException: [Ljava.lang.Object;
at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:88)
at
org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1339)
at
org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1339)
at
org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:993)
... 12 more

Any suggestions?


Thanks,

[image: --]
Xi Shen
[image: http://]about.me/davidshen
http://about.me/davidshen?promo=email_sig
  http://about.me/davidshen

Re: How to reuse a ML trained model?

2015-03-08 Thread Xi Shen

errr...do you have any suggestions for me before 1.3 release?

I can't believe there's no ML model serialize method in Spark. I think
training the models are quite expensive, isn't it?


Thanks,
David


On Sun, Mar 8, 2015 at 5:14 AM Burak Yavuz brk...@gmail.com wrote:

 Hi,

 There is model import/export for some of the ML algorithms on the current
 master (and they'll be shipped with the 1.3 release).

 Burak
 On Mar 7, 2015 4:17 AM, Xi Shen davidshe...@gmail.com wrote:

 Wait...it seem SparkContext does not provide a way to save/load object
 files. It can only save/load RDD. What do I missed here?


 Thanks,
 David


 On Sat, Mar 7, 2015 at 11:05 PM Xi Shen davidshe...@gmail.com wrote:

 Ah~it is serializable. Thanks!


 On Sat, Mar 7, 2015 at 10:59 PM Ekrem Aksoy ekremak...@gmail.com
 wrote:

 You can serialize your trained model to persist somewhere.

 Ekrem Aksoy

 On Sat, Mar 7, 2015 at 12:10 PM, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 I checked a few ML algorithms in MLLib.

 https://spark.apache.org/docs/0.8.1/api/mllib/index.html#
 org.apache.spark.mllib.classification.LogisticRegressionModel

 I could not find a way to save the trained model. Does this means I
 have to train my model every time? Is there a more economic way to do 
 this?

 I am thinking about something like:

 model.run(...)
 model.save(hdfs://path/to/hdfs)

 Then, next I can do:

 val model = Model.createFrom(hdfs://...)
 model.predict(vector)

 I am new to spark, maybe there are other ways to persistent the model?


 Thanks,
 David

Re: How to reuse a ML trained model?

2015-03-07 Thread Xi Shen

Ah~it is serializable. Thanks!


On Sat, Mar 7, 2015 at 10:59 PM Ekrem Aksoy ekremak...@gmail.com wrote:

 You can serialize your trained model to persist somewhere.

 Ekrem Aksoy

 On Sat, Mar 7, 2015 at 12:10 PM, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 I checked a few ML algorithms in MLLib.


 https://spark.apache.org/docs/0.8.1/api/mllib/index.html#org.apache.spark.mllib.classification.LogisticRegressionModel

 I could not find a way to save the trained model. Does this means I have
 to train my model every time? Is there a more economic way to do this?

 I am thinking about something like:

 model.run(...)
 model.save(hdfs://path/to/hdfs)

 Then, next I can do:

 val model = Model.createFrom(hdfs://...)
 model.predict(vector)

 I am new to spark, maybe there are other ways to persistent the model?


 Thanks,
 David

How to reuse a ML trained model?

2015-03-07 Thread Xi Shen

Hi,

I checked a few ML algorithms in MLLib.

https://spark.apache.org/docs/0.8.1/api/mllib/index.html#org.apache.spark.mllib.classification.LogisticRegressionModel

I could not find a way to save the trained model. Does this means I have to
train my model every time? Is there a more economic way to do this?

I am thinking about something like:

model.run(...)
model.save(hdfs://path/to/hdfs)

Then, next I can do:

val model = Model.createFrom(hdfs://...)
model.predict(vector)

I am new to spark, maybe there are other ways to persistent the model?


Thanks,
David

Re: How to reuse a ML trained model?

2015-03-07 Thread Xi Shen

Wait...it seem SparkContext does not provide a way to save/load object
files. It can only save/load RDD. What do I missed here?


Thanks,
David


On Sat, Mar 7, 2015 at 11:05 PM Xi Shen davidshe...@gmail.com wrote:

 Ah~it is serializable. Thanks!


 On Sat, Mar 7, 2015 at 10:59 PM Ekrem Aksoy ekremak...@gmail.com wrote:

 You can serialize your trained model to persist somewhere.

 Ekrem Aksoy

 On Sat, Mar 7, 2015 at 12:10 PM, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 I checked a few ML algorithms in MLLib.

 https://spark.apache.org/docs/0.8.1/api/mllib/index.html#
 org.apache.spark.mllib.classification.LogisticRegressionModel

 I could not find a way to save the trained model. Does this means I have
 to train my model every time? Is there a more economic way to do this?

 I am thinking about something like:

 model.run(...)
 model.save(hdfs://path/to/hdfs)

 Then, next I can do:

 val model = Model.createFrom(hdfs://...)
 model.predict(vector)

 I am new to spark, maybe there are other ways to persistent the model?


 Thanks,
 David

Re: Spark code development practice

2015-03-05 Thread Xi Shen

Thanks guys, this is very useful :)

@Stephen, I know spark-shell will create a SC for me. But I don't
understand why we still need to do new SparkContext(...) in our code.
Shouldn't we get it from some where? e.g. SparkContext.get.

Another question, if I want my spark code to run in YARN later, how should
I create the SparkContext? Or I can just specify --marst yarn on command
line?


Thanks,
David


On Fri, Mar 6, 2015 at 12:38 PM Koen Vantomme koen.vanto...@gmail.com
wrote:

 use the spark-shell command and the shell will open
 type :paste abd then paste your code, after control-d

 open spark-shell:
 sparks/bin
 ./spark-shell

 Verstuurd vanaf mijn iPhone

 Op 6-mrt.-2015 om 02:28 heeft fightf...@163.com fightf...@163.com het
 volgende geschreven:

 Hi,

 You can first establish a scala ide to develop and debug your spark
 program, lets say, intellij idea or eclipse.

 Thanks,
 Sun.

 --
 fightf...@163.com


 *From:* Xi Shen davidshe...@gmail.com
 *Date:* 2015-03-06 09:19
 *To:* user@spark.apache.org
 *Subject:* Spark code development practice
 Hi,

 I am new to Spark. I see every spark program has a main() function. I
 wonder if I can run the spark program directly, without using spark-submit.
 I think it will be easier for early development and debug.


 Thanks,
 David

spark-shell --master yarn-client fail on Windows

2015-03-05 Thread Xi Shen

Hi,

My HDFS and YARN services are started, and my spark-shell can wok in local
mode.

But when I try spark-shell --master yarn-client, a job can be created at
the YARN service, but will fail very soon. The Diagnostics are:

Application application_1425559747310_0002 failed 2 times due to AM
Container for appattempt_1425559747310_0002_02 exited with exitCode: 1
For more detailed output, check application tracking page:
http://Xi-Laptop:8088/proxy/application_1425559747310_0002/Then, click on
links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1425559747310_0002_02_01
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Shell output: 1 file(s) moved.
Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.

And in the AM log, there're something like:

Could not find or load main class
'-Dspark.driver.appUIAddress=http:..my-server:4040'

And it changes from time to time.

It feels like something is not right in YARN.


Thanks,
David

Spark code development practice

2015-03-05 Thread Xi Shen

Hi,

I am new to Spark. I see every spark program has a main() function. I
wonder if I can run the spark program directly, without using spark-submit.
I think it will be easier for early development and debug.


Thanks,
David

Re: How to start spark-shell with YARN?

2015-02-24 Thread Xi Shen

Hi Sean,

I launched the spark-shell on the same machine as I started YARN service. I
don't think port will be an issue.

I am new to spark. I checked the HDFS web UI and the YARN web UI. But I
don't know how to check the AM. Can you help?


Thanks,
David


On Tue, Feb 24, 2015 at 8:37 PM Sean Owen so...@cloudera.com wrote:

 I don't think the build is at issue. The error suggests your App Master
 can't be contacted. Is there a network port issue? did the AM fail?

 On Tue, Feb 24, 2015 at 9:15 AM, Xi Shen davidshe...@gmail.com wrote:

 Hi Arush,

 I got the pre-build from https://spark.apache.org/downloads.html. When I
 start spark-shell, it prompts:

 Spark assembly has been built with Hive, including Datanucleus jars
 on classpath

 So we don't have pre-build with YARN support? If so, how the spark-submit
 work? I checked the YARN log, and job is really submitted and ran
 successfully.


 Thanks,
 David





 On Tue Feb 24 2015 at 6:35:38 PM Arush Kharbanda 
 ar...@sigmoidanalytics.com wrote:

 Hi

 Are you sure that you built Spark for Yarn.If standalone works, not sure
 if its build for Yarn.

 Thanks
 Arush
 On Tue, Feb 24, 2015 at 12:06 PM, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 I followed this guide,
 http://spark.apache.org/docs/1.2.1/running-on-yarn.html, and tried to
 start spark-shell with yarn-client

 ./bin/spark-shell --master yarn-client


 But I got

 WARN ReliableDeliverySupervisor: Association with remote system 
 [akka.tcp://sparkYarnAM@10.0.2.15:38171] has failed, address is now gated 
 for [5000] ms. Reason is: [Disassociated].

 In the spark-shell, and other exceptions in they yarn log. Please see
 http://stackoverflow.com/questions/28671171/spark-shell-cannot-connect-to-yarn
 for more detail.


 However, submitting to the this cluster works. Also, spark-shell as
 standalone works.


 My system:

 - ubuntu amd64
 - spark 1.2.1
 - yarn from hadoop 2.6 stable


 Thanks,

 [image: --]
 Xi Shen
 [image: http://]about.me/davidshen
 http://about.me/davidshen?promo=email_sig
   http://about.me/davidshen


 --

 [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

 *Arush Kharbanda* || Technical Teamlead

 ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

69 matches

Mail list logo