Re: spark pi example fail on yarn

2016-10-21 Thread Xi Shen
YARN, it simply means AM >> is exited with FINISHED state. >> >> I'm guessing it might be related to memory constraints for container, please >> check the yarn RM and NM logs to find out more details. >> >> Thanks >> Saisai >> >> On Fri, Oct 21,

Re: spark pi example fail on yarn

2016-10-20 Thread Xi Shen
16/10/20 18:12:14 ERROR cluster.YarnClientSchedulerBackend: Yarn application has already exited with state FINISHED! From this, I think it is spark has difficult communicating with YARN. You should check your Spark log. On Fri, Oct 21, 2016 at 8:06 AM Li Li wrote: which

Re: Spark Random Forest training cost same time on yarn as on standalone

2016-10-20 Thread Xi Shen
If you are running on your local, I do not see the point that you start with 32 executors with 2 cores for each. Also, you can check the Spark web console to find out where the time spent. Also, you may want to read http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

Re: About Error while reading large JSON file in Spark

2016-10-18 Thread Xi Shen
It is a plain Java IO error. Your line is too long. You should alter your JSON schema, so each line is a small JSON object. Please do not concatenate all the object into an array, then write the array in one line. You will have difficulty handling your super large JSON array in Spark anyway.

Re: Question about the offiicial binary Spark 2 package

2016-10-17 Thread Xi Shen
Okay, thank you. On Mon, Oct 17, 2016 at 5:53 PM Sean Owen <so...@cloudera.com> wrote: > You can take the "with user-provided Hadoop" binary from the download > page, and yes that should mean it does not drag in a Hive dependency of its > own. > > On Mon,

Re: Is spark a right tool for updating a dataframe repeatedly

2016-10-17 Thread Xi Shen
I think most of the "big data" tools, like Spark and Hive, are not designed to edit data. They are only designed to query data. I wonder in what scenario you need to update large volume of data repetitively. On Mon, Oct 17, 2016 at 2:00 PM Divya Gehlot wrote: > If my

Question about the offiicial binary Spark 2 package

2016-10-17 Thread Xi Shen
Hi, I want to configure my Hive to use Spark 2 as its engine. According to Hive's instruction, the Spark should build *without *Hadoop, nor Hive. I could build my own, but for some reason I hope I could use a official binary build. So I want to ask if the official Spark binary build labeled

Hi,

2016-08-21 Thread Xi Shen
I found there are several .conf files in the conf directory, which one is used as the default one when I click the "new" button on the notebook homepage? I want to edit the default profile configuration so all my notebooks are created with custom settings. -- Thanks, David S.

How to implement a InputDStream like the twitter stream in Spark?

2016-08-17 Thread Xi Shen
Hi, First I am not sure if I should inherit from InputDStream, or ReceiverInputDStream. For ReceiverInputDStream, why would I want to run a receiver on each worker nodes? If I want to inherit InputDStream, what should I do in the comput() method? -- Thanks, David S.

Re: Why KMeans with mllib is so slow ?

2016-03-12 Thread Xi Shen
Hi Chitturi, Please checkout https://spark.apache.org/docs/1.0.1/api/java/org/apache/spark/mllib/clustering/KMeans.html#setInitializationSteps(int ). I think it is caused by the initialization step. the "kmeans||" method does not initialize dataset in parallel. If your dataset is large, it takes

Spark job concurrency problem

2015-05-04 Thread Xi Shen
use them to estimate the progress of the map operation. Looking at the log, it feels like the jobs are done one by one sequentially, rather than #cpu batch at a time. I checked the worker node, and their CPU are all busy. [image: --] Xi Shen [image: http://]about.me/davidshen http://about.me

IOUtils cannot write anything in Spark?

2015-04-23 Thread Xi Shen
} IOUtils.writeLines(lines, System.separator(), file) } Note, I was using the IOUtils from common-io, not from Hadoop package. The results are all file are created in myHDFS, but has no data at all... [image: --] Xi Shen [image: http://]about.me/davidshen http://about.me/davidshen?promo

Re: kmeans|| in Spark is not real paralleled?

2015-04-03 Thread Xi Shen
to produce this problem. Thanks! -Xiangrui On Sun, Mar 29, 2015 at 1:20 AM, Xi Shen davidshe...@gmail.com wrote: Hi, I have opened a couple of threads asking about k-means performance problem in Spark. I think I made a little progress. Previous I use the simplest way of KMeans.train(rdd, k

Re: Why k-means cluster hang for a long time?

2015-03-30 Thread Xi Shen
On Tue, Mar 31, 2015 at 4:00 AM Xiangrui Meng men...@gmail.com wrote: Hi Xi, Please create a JIRA if it takes longer to locate the issue. Did you try a smaller k? Best, Xiangrui On Thu, Mar 26, 2015 at 5:45 PM, Xi Shen davidshe...@gmail.com wrote: Hi Burak, After I added

Re: Why KMeans with mllib is so slow ?

2015-03-29 Thread Xi Shen
it. Thanks, David [image: --] Xi Shen [image: http://]about.me/davidshen http://about.me/davidshen?promo=email_sig http://about.me/davidshen On Sun, Mar 29, 2015 at 4:34 PM, Burak Yavuz brk...@gmail.com wrote: Hi David, Can you also try with Spark 1.3 if possible? I believe

kmeans|| in Spark is not real paralleled?

2015-03-29 Thread Xi Shen
this problem. I hope a Spark developer could comment on this problem and help identifying if it is a bug. Thanks, [image: --] Xi Shen [image: http://]about.me/davidshen http://about.me/davidshen?promo=email_sig http://about.me/davidshen

Re: k-means can only run on one executor with one thread?

2015-03-28 Thread Xi Shen
that long pause. Thanks, David [image: --] Xi Shen [image: http://]about.me/davidshen http://about.me/davidshen?promo=email_sig http://about.me/davidshen On Sat, Mar 28, 2015 at 2:38 PM, Xi Shen davidshe...@gmail.com wrote: Yes, I have done repartition. I tried to repartition

Re: k-means can only run on one executor with one thread?

2015-03-28 Thread Xi Shen
to fit in the driver. Reza On Sat, Mar 28, 2015 at 12:27 AM, Xi Shen davidshe...@gmail.com wrote: I have put more detail of my problem at http://stackoverflow.com/ questions/29295420/spark-kmeans-computation-cannot-be-distributed It is really appreciate if you can help me take a look

Re: k-means can only run on one executor with one thread?

2015-03-27 Thread Xi Shen
of partitions when you load the data to equal the number of executors? If your ETL changes the number of partitions, you can also repartition before calling KMeans. On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen davidshe...@gmail.com wrote: Hi, I have a large data set, and I expects to get 5000

Re: How to deploy binary dependencies to workers?

2015-03-26 Thread Xi Shen
OK, after various testing, I found the native library can be loaded if running in yarn-cluster mode. But I still cannot find out why it won't load when running in yarn-client mode... Thanks, David On Thu, Mar 26, 2015 at 4:21 PM Xi Shen davidshe...@gmail.com wrote: Not of course...all

Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
Hi, When I run k-means cluster with Spark, I got this in the last two lines in the log: 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5 Then it hangs for a long time. There's no active job. The driver machine is

SparkContext.wholeTextFiles throws not serializable exception

2015-03-26 Thread Xi Shen
wrong with using the lines() function. Thanks, [image: --] Xi Shen [image: http://]about.me/davidshen http://about.me/davidshen?promo=email_sig http://about.me/davidshen

k-means can only run on one executor with one thread?

2015-03-26 Thread Xi Shen
Hi, I have a large data set, and I expects to get 5000 clusters. I load the raw data, convert them into DenseVector; then I did repartition and cache; finally I give the RDD[Vector] to KMeans.train(). Now the job is running, and data are loaded. But according to the Spark UI, all data are

Re: SparkContext.wholeTextFiles throws not serializable exception

2015-03-26 Thread Xi Shen
I have to use .lines.toArray.toSeq A little tricky. [image: --] Xi Shen [image: http://]about.me/davidshen http://about.me/davidshen?promo=email_sig http://about.me/davidshen On Fri, Mar 27, 2015 at 4:41 PM, Xi Shen davidshe...@gmail.com wrote: Hi, I want to load my data in this way

Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
AM Xi Shen davidshe...@gmail.com wrote: How do I get the number of cores that I specified at the command line? I want to use spark.default.parallelism. I have 4 executors, each has 8 cores. According to https://spark.apache.org/docs/1.2.0/configuration.html#execution-behavior

Re: Building spark 1.2 from source requires more dependencies

2015-03-26 Thread Xi Shen
It it bought in by another dependency, so you do not need to specify it explicitly...I think this is what Ted mean. On Fri, Mar 27, 2015 at 9:48 AM Pala M Muthaia mchett...@rocketfuelinc.com wrote: +spark-dev Yes, the dependencies are there. I guess my question is how come the build is

Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
(numCores) 2) cache data 3) call .count() on data right before k-means 4) try k=500 (even less if possible) Thanks, Burak On Mar 26, 2015 4:15 PM, Xi Shen davidshe...@gmail.com wrote: The code is very simple. val data = sc.textFile(very/large/text/file) map { l = // turn each line

Re: K Means cluster with spark

2015-03-26 Thread Xi Shen
Hi Sandeep, I followed the DenseKMeans example which comes with the spark package. My total vectors are about 40k, and my k=500. All my code are written in Scala. Thanks, David On Fri, 27 Mar 2015 05:51 sandeep vura sandeepv...@gmail.com wrote: Hi Shen, I am also working on k means

Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
increases the work in executors. If that's not the case, can you give more info on what Spark version you are using, your setup, and your dataset? Thanks, Burak On Mar 26, 2015 5:10 AM, Xi Shen davidshe...@gmail.com wrote: Hi, When I run k-means cluster with Spark, I got this in the last two

Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
OH, the job I talked about has ran more than 11 hrs without a result...it doesn't make sense. On Fri, Mar 27, 2015 at 9:48 AM Xi Shen davidshe...@gmail.com wrote: Hi Burak, My iterations is set to 500. But I think it should also stop of the centroid coverages, right? My spark is 1.2.0

Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
, Mar 27, 2015 at 10:04 AM Burak Yavuz brk...@gmail.com wrote: Can you share the code snippet of how you call k-means? Do you cache the data before k-means? Did you repartition the data? On Mar 26, 2015 4:02 PM, Xi Shen davidshe...@gmail.com wrote: OH, the job I talked about has ran more than

Re: How to troubleshoot server.TransportChannelHandler Exception

2015-03-26 Thread Xi Shen
ah~hell, I am using Spark 1.2.0, and my job was submitted to use 8 cores...the magic number in the bug. [image: --] Xi Shen [image: http://]about.me/davidshen http://about.me/davidshen?promo=email_sig http://about.me/davidshen On Thu, Mar 26, 2015 at 5:48 PM, Akhil Das ak

Re: How to deploy binary dependencies to workers?

2015-03-25 Thread Xi Shen
Not of course...all machines in HDInsight are Windows 64bit server. And I have made sure all my DLLs are for 64bit machines. I have managed to get those DLLs loade on my local machine which is also Windows 64bit. [image: --] Xi Shen [image: http://]about.me/davidshen http://about.me/davidshen

How to troubleshoot server.TransportChannelHandler Exception

2015-03-25 Thread Xi Shen
the application log, it is miles long, and this is the only exception I found. And it is no very useful to help me pin point the problem. Any idea what would be the cause? Thanks, [image: --] Xi Shen [image: http://]about.me/davidshen http://about.me/davidshen?promo=email_sig http://about.me/davidshen

Re: issue while submitting Spark Job as --master yarn-cluster

2015-03-25 Thread Xi Shen
What is your environment? I remember I had similar error when running spark-shell --master yarn-client in Windows environment. On Wed, Mar 25, 2015 at 9:07 PM sachin Singh sachin.sha...@gmail.com wrote: Hi , when I am submitting spark job in cluster mode getting error as under in hadoop-yarn

Re: How to deploy binary dependencies to workers?

2015-03-25 Thread Xi Shen
, Mar 24, 2015 at 4:13 AM, Xi Shen davidshe...@gmail.com wrote: Hi, I am doing ML using Spark mllib. However, I do not have full control to the cluster. I am using Microsoft Azure HDInsight I want to deploy the BLAS or whatever required dependencies to accelerate the computation. But I

How to deploy binary dependencies to workers?

2015-03-24 Thread Xi Shen
Hi, I am doing ML using Spark mllib. However, I do not have full control to the cluster. I am using Microsoft Azure HDInsight I want to deploy the BLAS or whatever required dependencies to accelerate the computation. But I don't know how to deploy those DLLs when I submit my JAR to the cluster.

Re: netlib-java cannot load native lib in Windows when using spark-submit

2015-03-23 Thread Xi Shen
option ? spark-submit --driver-library-path /opt/hadoop/lib/native ... Cheers On Sat, Mar 21, 2015 at 4:58 PM, Xi Shen davidshe...@gmail.com wrote: Hi, I use the *OpenBLAS* DLL, and have configured my application to work in IDE. When I start my Spark application from IntelliJ IDE, I can see

Re: netlib-java cannot load native lib in Windows when using spark-submit

2015-03-22 Thread Xi Shen
/opt/hadoop/lib/native ... Cheers On Sat, Mar 21, 2015 at 4:58 PM, Xi Shen davidshe...@gmail.com wrote: Hi, I use the *OpenBLAS* DLL, and have configured my application to work in IDE. When I start my Spark application from IntelliJ IDE, I can see in the log that the native lib is loaded

Re: How to set Spark executor memory?

2015-03-22 Thread Xi Shen
On Sun, Mar 22, 2015 at 2:10 PM Ted Yu yuzhih...@gmail.com wrote: bq. the BLAS native cannot be loaded Have you tried specifying --driver-library-path option ? Cheers On Sat, Mar 21, 2015 at 4:42 PM, Xi Shen davidshe...@gmail.com wrote: Yeah, I think it is harder to troubleshot the properties

Re: How to do nested foreach with RDD

2015-03-22 Thread Xi Shen
).filter{ case (a,b) = a b } Reza On Sat, Mar 21, 2015 at 10:37 PM, Xi Shen davidshe...@gmail.com wrote: Hi, I have two big RDD, and I need to do some math against each pair of them. Traditionally, it is like a nested for-loop. But for RDD, it cause a nested RDD which is prohibited

Re: How to set Spark executor memory?

2015-03-21 Thread Xi Shen
: Hi Xi Shen, You could set the spark.executor.memory in the code itself . new SparkConf()..set(spark.executor.memory, 2g) Or you can try the -- spark.executor.memory 2g while submitting the jar. Regards Jishnu Prathap *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com] *Sent

Re: Can I start multiple executors in local mode?

2015-03-21 Thread Xi Shen
2015-03-16 19:46 GMT+08:00 Xi Shen davidshe...@gmail.com: Hi, In YARN mode you can specify the number of executors. I wonder if we can also start multiple executors at local, just to make the test run faster. Thanks, David

Re: How to set Spark executor memory?

2015-03-21 Thread Xi Shen
In the log, I saw MemoryStorage: MemoryStore started with capacity 6.7GB But I still can not find where to set this storage capacity. On Sat, 21 Mar 2015 20:30 Xi Shen davidshe...@gmail.com wrote: Hi Sean, It's getting strange now. If I ran from IDE, my executor memory is always set

netlib-java cannot load native lib in Windows when using spark-submit

2015-03-21 Thread Xi Shen
Hi, I use the *OpenBLAS* DLL, and have configured my application to work in IDE. When I start my Spark application from IntelliJ IDE, I can see in the log that the native lib is loaded successfully. But if I use *spark-submit* to start my application, the native lib still cannot be load. I saw

Re: How to set Spark executor memory?

2015-03-21 Thread Xi Shen
Yeah, I think it is harder to troubleshot the properties issues in a IDE. But the reason I stick to IDE is because if I use spark-submit, the BLAS native cannot be loaded. May be I should open another thread to discuss that. Thanks, David On Sun, 22 Mar 2015 10:38 Xi Shen davidshe...@gmail.com

How to do nested foreach with RDD

2015-03-21 Thread Xi Shen
Hi, I have two big RDD, and I need to do some math against each pair of them. Traditionally, it is like a nested for-loop. But for RDD, it cause a nested RDD which is prohibited. Currently, I am collecting one of them, then do a nested for-loop, so to avoid nested RDD. But would like to know if

Suggestion for user logging

2015-03-16 Thread Xi Shen
Hi, When you submit a jar to the spark cluster, it is very difficult to see the logging. Is there any way to save the logging to a file? I mean only the logging I created not the Spark log information. Thanks, David

Re: k-means hang without error/warning

2015-03-16 Thread Xi Shen
On Mon, 16 Mar 2015 22:30 Sean Owen so...@cloudera.com wrote: I think you'd have to say more about stopped working. Is the GC thrashing? does the UI respond? is the CPU busy or not? On Mon, Mar 16, 2015 at 4:25 AM, Xi Shen davidshe...@gmail.com wrote: Hi, I am running k-means using

Can I start multiple executors in local mode?

2015-03-16 Thread Xi Shen
Hi, In YARN mode you can specify the number of executors. I wonder if we can also start multiple executors at local, just to make the test run faster. Thanks, David

How to set Spark executor memory?

2015-03-16 Thread Xi Shen
Hi, I have set spark.executor.memory to 2048m, and in the UI Environment page, I can see this value has been set correctly. But in the Executors page, I saw there's only 1 executor and its memory is 265.4MB. Very strange value. why not 256MB, or just as what I set? What am I missing here?

Re: k-means hang without error/warning

2015-03-16 Thread Xi Shen
] will allocate 4 threads. You can try increasing it to a higher number also try setting level of parallelism to a higher number. Thanks Best Regards On Mon, Mar 16, 2015 at 9:55 AM, Xi Shen davidshe...@gmail.com wrote: Hi, I am running k-means using Spark in local mode. My data set is about

Re: How to set Spark executor memory?

2015-03-16 Thread Xi Shen
:52 PM, Xi Shen davidshe...@gmail.com wrote: Hi, I have set spark.executor.memory to 2048m, and in the UI Environment page, I can see this value has been set correctly. But in the Executors page, I saw there's only 1 executor and its memory is 265.4MB. Very strange value. why not 256MB

Re: How to set Spark executor memory?

2015-03-16 Thread Xi Shen
Regards On Mon, Mar 16, 2015 at 1:52 PM, Xi Shen davidshe...@gmail.com wrote: I set it in code, not by configuration. I submit my jar file to local. I am working in my developer environment. On Mon, 16 Mar 2015 18:28 Akhil Das ak...@sigmoidanalytics.com wrote: How are you setting it? and how

Re: How to set Spark executor memory?

2015-03-16 Thread Xi Shen
memory are you having on your machine? I think default value is 0.6 of the spark.executor.memory as you can see from here http://spark.apache.org/docs/1.2.1/configuration.html#execution-behavior . Thanks Best Regards On Mon, Mar 16, 2015 at 2:26 PM, Xi Shen davidshe...@gmail.com wrote: Hi Akhil

k-means hang without error/warning

2015-03-15 Thread Xi Shen
Hi, I am running k-means using Spark in local mode. My data set is about 30k records, and I set the k = 1000. The algorithm starts and finished 13 jobs according to the UI monitor, then it stopped working. The last log I saw was: [Spark Context Cleaner] INFO org.apache.spark.ContextCleaner -

Please help me understand TF-IDF Vector structure

2015-03-14 Thread Xi Shen
Hi, I read this document, http://spark.apache.org/docs/1.2.1/mllib-feature-extraction.html, and tried to build a TF-IDF model of my documents. I have a list of documents, each word is represented as a Int, and each document is listed in one line. doc_name, int1, int2... doc_name, int3, int4...

Re: Please help me understand TF-IDF Vector structure

2015-03-14 Thread Xi Shen
Hey, I work it out myself :) The Vector is actually a SparesVector, so when it is written into a string, the format is (size, [coordinate], [value...]) Simple! On Sat, Mar 14, 2015 at 6:05 PM Xi Shen davidshe...@gmail.com wrote: Hi, I read this document, http://spark.apache.org/docs

How to do spares vector product in Spark?

2015-03-13 Thread Xi Shen
Hi, I have two RDD[Vector], both Vector are spares and of the form: (id, value) id indicates the position of the value in the vector space. I want to apply dot product on two of such RDD[Vector] and get a scale value. The none exist values are treated as zero. Any convenient tool to do

How to use the TF-IDF model?

2015-03-09 Thread Xi Shen
Hi, I read this page, http://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html. But I am wondering, how to use this TF-IDF RDD? What is this TF-IDF vector looks like? Can someone provide me some guide? Thanks, [image: --] Xi Shen [image: http://]about.me/davidshen http://about.me

How to load my ML model?

2015-03-09 Thread Xi Shen
) at org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:993) ... 12 more Any suggestions? Thanks, [image: --] Xi Shen [image: http://]about.me/davidshen http://about.me/davidshen?promo

Re: How to reuse a ML trained model?

2015-03-08 Thread Xi Shen
import/export for some of the ML algorithms on the current master (and they'll be shipped with the 1.3 release). Burak On Mar 7, 2015 4:17 AM, Xi Shen davidshe...@gmail.com wrote: Wait...it seem SparkContext does not provide a way to save/load object files. It can only save/load RDD. What do I

Re: How to reuse a ML trained model?

2015-03-07 Thread Xi Shen
Ah~it is serializable. Thanks! On Sat, Mar 7, 2015 at 10:59 PM Ekrem Aksoy ekremak...@gmail.com wrote: You can serialize your trained model to persist somewhere. Ekrem Aksoy On Sat, Mar 7, 2015 at 12:10 PM, Xi Shen davidshe...@gmail.com wrote: Hi, I checked a few ML algorithms in MLLib

How to reuse a ML trained model?

2015-03-07 Thread Xi Shen
Hi, I checked a few ML algorithms in MLLib. https://spark.apache.org/docs/0.8.1/api/mllib/index.html#org.apache.spark.mllib.classification.LogisticRegressionModel I could not find a way to save the trained model. Does this means I have to train my model every time? Is there a more economic way

Re: How to reuse a ML trained model?

2015-03-07 Thread Xi Shen
Wait...it seem SparkContext does not provide a way to save/load object files. It can only save/load RDD. What do I missed here? Thanks, David On Sat, Mar 7, 2015 at 11:05 PM Xi Shen davidshe...@gmail.com wrote: Ah~it is serializable. Thanks! On Sat, Mar 7, 2015 at 10:59 PM Ekrem Aksoy

Re: Spark code development practice

2015-03-05 Thread Xi Shen
, Sun. -- fightf...@163.com *From:* Xi Shen davidshe...@gmail.com *Date:* 2015-03-06 09:19 *To:* user@spark.apache.org *Subject:* Spark code development practice Hi, I am new to Spark. I see every spark program has a main() function. I wonder if I can run

spark-shell --master yarn-client fail on Windows

2015-03-05 Thread Xi Shen
Hi, My HDFS and YARN services are started, and my spark-shell can wok in local mode. But when I try spark-shell --master yarn-client, a job can be created at the YARN service, but will fail very soon. The Diagnostics are: Application application_1425559747310_0002 failed 2 times due to AM

Spark code development practice

2015-03-05 Thread Xi Shen
Hi, I am new to Spark. I see every spark program has a main() function. I wonder if I can run the spark program directly, without using spark-submit. I think it will be easier for early development and debug. Thanks, David

Re: How to start spark-shell with YARN?

2015-02-24 Thread Xi Shen
Owen so...@cloudera.com wrote: I don't think the build is at issue. The error suggests your App Master can't be contacted. Is there a network port issue? did the AM fail? On Tue, Feb 24, 2015 at 9:15 AM, Xi Shen davidshe...@gmail.com wrote: Hi Arush, I got the pre-build from https