RE: How to get progress information of an RDD operation

2016-02-24 Thread Wang, Ningjun (LNG-NPV)
? Ningjun From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Tuesday, February 23, 2016 2:30 PM To: Kevin Mellott Cc: Wang, Ningjun (LNG-NPV); user@spark.apache.org Subject: Re: How to get progress information of an RDD operation I think Ningjun was looking for programmatic way of tracking progress. I took

How to get progress information of an RDD operation

2016-02-23 Thread Wang, Ningjun (LNG-NPV)
How can I get progress information of a RDD operation? For example val lines = sc.textFile("c:/temp/input.txt") // a RDD of millions of line lines.foreach(line => { handleLine(line) }) The input.txt contains millions of lines. The entire operation take 6 hours. I want to print out

How to create dataframe from SQL Server SQL query

2015-12-07 Thread Wang, Ningjun (LNG-NPV)
How can I create a RDD from a SQL query against SQLServer database? Here is the example of dataframe http://spark.apache.org/docs/latest/sql-programming-guide.html#overview val jdbcDF = sqlContext.read.format("jdbc").options( Map("url" -> "jdbc:postgresql:dbserver", "dbtable" ->

RE: How to create dataframe from SQL Server SQL query

2015-12-07 Thread Wang, Ningjun (LNG-NPV)
This is a very helpful article. Thanks for the help. Ningjun From: Sujit Pal [mailto:sujitatgt...@gmail.com] Sent: Monday, December 07, 2015 12:42 PM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: How to create dataframe from SQL Server SQL query Hi Ningjun, Haven't done

RE: Why is my spark executor is terminated?

2015-10-14 Thread Wang, Ningjun (LNG-NPV)
, October 13, 2015 10:42 AM To: user@spark.apache.org Subject: Re: Why is my spark executor is terminated? Hi Ningjun, Nothing special in the master log ? Regards JB On 10/13/2015 04:34 PM, Wang, Ningjun (LNG-NPV) wrote: > We use spark on windows 2008 R2 servers. We use one spark context > which

Why is my spark executor is terminated?

2015-10-13 Thread Wang, Ningjun (LNG-NPV)
We use spark on windows 2008 R2 servers. We use one spark context which create one spark executor. We run spark master, slave, driver, executor on one single machine. >From time to time, we found that the executor JAVA process was terminated. I >cannot fig out why it was terminated. Can

RE: How to register array class with Kyro in spark-defaults.conf

2015-07-31 Thread Wang, Ningjun (LNG-NPV)
Does anybody have any idea how to solve this problem? Ningjun From: Wang, Ningjun (LNG-NPV) Sent: Thursday, July 30, 2015 11:06 AM To: user@spark.apache.org Subject: How to register array class with Kyro in spark-defaults.conf I register my class with Kyro in spark-defaults.conf as follow

RE: How to register array class with Kyro in spark-defaults.conf

2015-07-31 Thread Wang, Ningjun (LNG-NPV)
: Friday, July 31, 2015 11:49 AM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: How to register array class with Kyro in spark-defaults.conf For the second exception, was there anything following SparkException which would give us more clue ? Can you tell us how EsDoc

How to register array class with Kyro in spark-defaults.conf

2015-07-30 Thread Wang, Ningjun (LNG-NPV)
I register my class with Kyro in spark-defaults.conf as follow spark.serializer org.apache.spark.serializer.KryoSerializer spark.kryo.registrationRequired true spark.kryo.classesToRegister ltn.analytics.es.EsDoc But I got the following

RE: java.lang.NoClassDefFoundError: Could not initialize class org.fusesource.jansi.internal.Kernel32

2015-07-17 Thread Wang, Ningjun (LNG-NPV)
Does anybody have any idea what cuase this problem? Thanks. Ningjun From: Wang, Ningjun (LNG-NPV) Sent: Wednesday, July 15, 2015 11:09 AM To: user@spark.apache.org Subject: java.lang.NoClassDefFoundError: Could not initialize class org.fusesource.jansi.internal.Kernel32 I just installed spark

java.lang.NoClassDefFoundError: Could not initialize class org.fusesource.jansi.internal.Kernel32

2015-07-15 Thread Wang, Ningjun (LNG-NPV)
I just installed spark 1.3.1 on windows 2008 server. When I start spark-shell, I got the following error Failed to created SparkJLineReader: java.lang.NoClassDefFoundError: Could not initialize class org.fusesource.jansi.internal.Kernel32 Please advise. Thanks. Ningjun

Cannot iterate items in rdd.mapPartition()

2015-06-26 Thread Wang, Ningjun (LNG-NPV)
In rdd.mapPartition(...) if I try to iterate through the items in the partition, everything screw. For example val rdd = sc.parallelize(1 to 1000, 3) val count = rdd.mapPartitions(iter = { println(iter.length) iter }).count() The count is 0. This is incorrect. The count should be 1000. If

Does spark performance really scale out with multiple machines?

2015-06-15 Thread Wang, Ningjun (LNG-NPV)
I try to measure how spark standalone cluster performance scale out with multiple machines. I did a test of training the SVM model which is heavy in memory computation. I measure the run time for spark standalone cluster of 1 - 3 nodes, the result is following 1 node: 35 minutes 2 nodes: 30.1

RE: How to set spark master URL to contain domain name?

2015-06-12 Thread Wang, Ningjun (LNG-NPV)
I think the problem is that in my local etc/hosts file, I have 10.196.116.95 WIN02 I will remove it and try. Thanks for the help. Ningjun From: prajod.vettiyat...@wipro.com [mailto:prajod.vettiyat...@wipro.com] Sent: Friday, June 12, 2015 1:44 AM To: Wang, Ningjun (LNG-NPV) Cc: user

How to set spark master URL to contain domain name?

2015-06-11 Thread Wang, Ningjun (LNG-NPV)
I start spark master on windows using bin\spark-class.cmd org.apache.spark.deploy.master.Master Then I goto http://localhost:8080/ to find the master URL, it is spark://WIN02:7077 Here WIN02 is my machine name. Why does it missing the domain name? If I start the spark master on other

RE: spark on Windows 2008 failed to save RDD to windows shared folder

2015-05-26 Thread Wang, Ningjun (LNG-NPV)
\mysharefile:///\\10.196.119.230\myshare Ningjun From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Friday, May 22, 2015 5:02 PM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: spark on Windows 2008 failed to save RDD to windows shared folder The stack trace is related to hdfs. Can you tell

spark on Windows 2008 failed to save RDD to windows shared folder

2015-05-22 Thread Wang, Ningjun (LNG-NPV)
I used spark standalone cluster on Windows 2008. I kept on getting the following error when trying to save an RDD to a windows shared folder rdd.saveAsObjectFile(file:///T:/lab4-win02/IndexRoot01/tobacco-07/myrdd.obj) 15/05/22 16:49:05 ERROR Executor: Exception in task 0.0 in stage 12.0 (TID

RE: rdd.sample() methods very slow

2015-05-21 Thread Wang, Ningjun (LNG-NPV)
...@cloudera.com] Sent: Thursday, May 21, 2015 11:30 AM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: rdd.sample() methods very slow I guess the fundamental issue is that these aren't stored in a way that allows random access to a Document. Underneath, Hadoop has a concept of a MapFile

RE: rdd.sample() methods very slow

2015-05-21 Thread Wang, Ningjun (LNG-NPV)
document). How can I do this quickly? The rdd.sample() methods does not help because it need to read the entire RDD of 7 million Document from disk which take very long time. Ningjun From: Sean Owen [mailto:so...@cloudera.com] Sent: Tuesday, May 19, 2015 4:51 PM To: Wang, Ningjun (LNG-NPV) Cc

rdd.sample() methods very slow

2015-05-19 Thread Wang, Ningjun (LNG-NPV)
Hi I have an RDD[Document] that contains 7 million objects and it is saved in file system as object file. I want to get a random sample of about 70 objects from it using rdd.sample() method. It is ver slow val rdd : RDD[Document] = sc.objectFile[Document](C:/temp/docs.obj).sample(false,

RE: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_2_piece0

2015-05-07 Thread Wang, Ningjun (LNG-NPV)
) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Ningjun From: Jonathan Coveney [mailto:jcove...@gmail.com] Sent: Wednesday, May 06, 2015 5:23 PM To: Wang, Ningjun (LNG-NPV) Cc: Ted Yu; user@spark.apache.org

RE: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_2_piece0

2015-05-06 Thread Wang, Ningjun (LNG-NPV)
:32 AM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_2_piece0 Which release of Spark are you using ? Thanks On May 6, 2015, at 8:03 AM, Wang, Ningjun (LNG-NPV) ningjun.w

java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_2_piece0

2015-05-06 Thread Wang, Ningjun (LNG-NPV)
I run a job on spark standalone cluster and got the exception below Here is the line of code that cause problem val myRdd: RDD[(String, String, String)] = ... // RDD of (docid, cattegory, path) myRdd.persist(StorageLevel.MEMORY_AND_DISK_SER) val cats: Array[String] = myRdd.map(t =

RE: HOw can I merge multiple DataFrame and remove duplicated key

2015-04-30 Thread Wang, Ningjun (LNG-NPV)
a DataFrame to RDD and then invoke the recudeByKey Ningjun From: ayan guha [mailto:guha.a...@gmail.com] Sent: Thursday, April 30, 2015 3:41 AM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: RE: HOw can I merge multiple DataFrame and remove duplicated key 1. Do a group by and get

HOw can I merge multiple DataFrame and remove duplicated key

2015-04-29 Thread Wang, Ningjun (LNG-NPV)
I have multiple DataFrame objects each stored in a parquet file. The DataFrame just contains 3 columns (id, value, timeStamp). I need to union all the DataFrame objects together but for duplicated id only keep the record with the latest timestamp. How can I do that? I can do this for RDDs

RE: HOw can I merge multiple DataFrame and remove duplicated key

2015-04-29 Thread Wang, Ningjun (LNG-NPV)
, value2, 2015-01-02 id2, value4, 2015-01-02 I can use reduceByKey() in RDD but how to do it using DataFrame? Can you give an example code snipet? Thanks Ningjun From: ayan guha [mailto:guha.a...@gmail.com] Sent: Wednesday, April 29, 2015 5:54 PM To: Wang, Ningjun (LNG-NPV) Cc: user

Can I index a column in parquet file to make it join faster

2015-04-22 Thread Wang, Ningjun (LNG-NPV)
I have two RDDs each saved in a parquet file. I need to join this two RDDs by the id column. Can I created index on the id column so they can join faster? Here is the code case class Example(val id: String, val category: String) case class DocVector(val id: String, val vector: Vector) val

implicits is not a member of org.apache.spark.sql.SQLContext

2015-04-21 Thread Wang, Ningjun (LNG-NPV)
I tried to convert an RDD to a data frame using the example codes on spark website case class Person(name: String, age: Int) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ val people =

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Wang, Ningjun (LNG-NPV)
Does anybody have a solution for this? From: Wang, Ningjun (LNG-NPV) Sent: Tuesday, April 14, 2015 10:41 AM To: user@spark.apache.org Subject: How to join RDD keyValuePairs efficiently I have an RDD that contains millions of Document objects. Each document has an unique Id that is a string. I

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Wang, Ningjun (LNG-NPV)
IndexedRDD on the web https://github.com/amplab/spark-indexedrdd Has anybody use it? Ningjun -Original Message- From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Thursday, April 16, 2015 12:18 PM To: 'Sean Owen'; Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: RE: How

RE: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-14 Thread Wang, Ningjun (LNG-NPV)
=-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.appDataTtl=seconds On 11.04.2015, at 00:01, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.commailto:ningjun.w...@lexisnexis.com wrote: Does anybody have an answer for this? Thanks Ningjun From: Wang, Ningjun (LNG-NPV) Sent: Thursday, April 02

How to join RDD keyValuePairs efficiently

2015-04-14 Thread Wang, Ningjun (LNG-NPV)
I have an RDD that contains millions of Document objects. Each document has an unique Id that is a string. I need to find the documents by ids quickly. Currently I used RDD join as follow First I save the RDD as object file allDocs : RDD[Document] = getDocs() // this RDD contains 7 million

RE: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-10 Thread Wang, Ningjun (LNG-NPV)
Does anybody have an answer for this? Thanks Ningjun From: Wang, Ningjun (LNG-NPV) Sent: Thursday, April 02, 2015 12:14 PM To: user@spark.apache.org Subject: Is the disk space in SPARK_LOCAL_DIRS cleanned up? I set SPARK_LOCAL_DIRS to C:\temp\spark-temp. When RDDs are shuffled, spark

Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-02 Thread Wang, Ningjun (LNG-NPV)
I set SPARK_LOCAL_DIRS to C:\temp\spark-temp. When RDDs are shuffled, spark writes to this folder. I found that the disk space of this folder keep on increase quickly and at certain point I will run out of disk space. I wonder does spark clean up the disk spac in this folder once the

RE: How to get rdd count() without double evaluation of the RDD?

2015-03-30 Thread Wang, Ningjun (LNG-NPV)
: Mark Hamstra [mailto:m...@clearstorydata.com] Sent: Thursday, March 26, 2015 12:37 PM To: Sean Owen Cc: Wang, Ningjun (LNG-NPV); user@spark.apache.org Subject: Re: How to get rdd count() without double evaluation of the RDD? You can also always take the more extreme approach of using SparkContext

How to get rdd count() without double evaluation of the RDD?

2015-03-26 Thread Wang, Ningjun (LNG-NPV)
I have a rdd that is expensive to compute. I want to save it as object file and also print the count. How can I avoid double computation of the RDD? val rdd = sc.textFile(someFile).map(line = expensiveCalculation(line)) val count = rdd.count() // this force computation of the rdd

RE: sc.textFile() on windows cannot access UNC path

2015-03-12 Thread Wang, Ningjun (LNG-NPV)
: Wednesday, March 11, 2015 2:40 AM To: Wang, Ningjun (LNG-NPV) Cc: java8964; user@spark.apache.org Subject: Re: sc.textFile() on windows cannot access UNC path ​​ I don't have a complete example for your usecase, but you can see a lot of codes showing how to use new APIHadoopFile from herehttps

RE: Is it possible to use windows service to start and stop spark standalone cluster

2015-03-11 Thread Wang, Ningjun (LNG-NPV)
Thanks for the suggestion. I will try that. Ningjun From: Silvio Fiorito [mailto:silvio.fior...@granturing.com] Sent: Wednesday, March 11, 2015 12:40 AM To: Wang, Ningjun (LNG-NPV); user@spark.apache.org Subject: Re: Is it possible to use windows service to start and stop spark standalone

Is it possible to use windows service to start and stop spark standalone cluster

2015-03-10 Thread Wang, Ningjun (LNG-NPV)
We are using spark stand alone cluster on Windows 2008 R2. I can start spark clusters by open an command prompt and run the following bin\spark-class.cmd org.apache.spark.deploy.master.Master bin\spark-class.cmd org.apache.spark.deploy.worker.Worker spark://mywin.mydomain.com:7077 I can stop

sc.textFile() on windows cannot access UNC path

2015-03-09 Thread Wang, Ningjun (LNG-NPV)
I am running Spark on windows 2008 R2. I use sc.textFile() to load text file using UNC path, it does not work. sc.textFile(rawfile:10.196.119.230/folder1/abc.txt, 4).count() Input path does not exist: file:/10.196.119.230/folder1/abc.txt org.apache.hadoop.mapred.InvalidInputException:

RE: sc.textFile() on windows cannot access UNC path

2015-03-09 Thread Wang, Ningjun (LNG-NPV)
(...)? Ningjun From: java8964 [mailto:java8...@hotmail.com] Sent: Monday, March 09, 2015 5:33 PM To: Wang, Ningjun (LNG-NPV); user@spark.apache.org Subject: RE: sc.textFile() on windows cannot access UNC path This is a Java problem, not really Spark. From this page: http://stackoverflow.com/questions

How to union RDD and remove duplicated keys

2015-02-13 Thread Wang, Ningjun (LNG-NPV)
I have multiple RDD[(String, String)] that store (docId, docText) pairs, e.g. rdd1: (id1, Long text 1), (id2, Long text 2), (id3, Long text 3) rdd2: (id1, Long text 1 A), (id2, Long text 2 A) rdd3: (id1, Long text 1 B) Then, I want to merge all RDDs. If there is duplicated docids, later

RE: How to union RDD and remove duplicated keys

2015-02-13 Thread Wang, Ningjun (LNG-NPV)
is appreciated because I am new to Spark. Ningjun From: Boromir Widas [mailto:vcsub...@gmail.com] Sent: Friday, February 13, 2015 1:28 PM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: How to union RDD and remove duplicated keys reducebyKey should work, but you need to define the ordering

RE: Fail to launch spark-shell on windows 2008 R2

2015-02-03 Thread Wang, Ningjun (LNG-NPV)
it integrate with our existing app easily. Has anybody use spark on windows for production system? Is spark reliable on windows? Ningjun From: gen tang [mailto:gen.tan...@gmail.com] Sent: Thursday, January 29, 2015 12:53 PM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: Fail

RE: Fail to launch spark-shell on windows 2008 R2

2015-01-29 Thread Wang, Ningjun (LNG-NPV)
only use local file system and do not have any hdfs file system at all. I don’t understand why spark generate so many error on Hadoop while we don’t even need hdfs. Ningjun From: gen tang [mailto:gen.tan...@gmail.com] Sent: Thursday, January 29, 2015 10:45 AM To: Wang, Ningjun (LNG-NPV) Cc: user

RE: Spark on Windows 2008 R2 serv er does not work

2015-01-29 Thread Wang, Ningjun (LNG-NPV)
, Ningjun Wang Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541 -Original Message- From: Marcelo Vanzin [mailto:van...@cloudera.com] Sent: Wednesday, January 28, 2015 5:15 PM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: Spark on Windows

RE: Spark on Windows 2008 R2 serv er does not work

2015-01-28 Thread Wang, Ningjun (LNG-NPV)
Has anybody successfully install and run spark-1.2.0 on windows 2008 R2 or windows 7? How do you get that works? Regards, Ningjun Wang Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541 From: Wang, Ningjun (LNG-NPV) [mailto:ningjun.w...@lexisnexis.com] Sent

How to start spark master on windows

2015-01-27 Thread Wang, Ningjun (LNG-NPV)
I download spark 1.2.0 on my windows server 2008. How do I start spark master? I tried to run the following on command prompt C:\spark-1.2.0-bin-hadoop2.4 bin\spark-class.cmd org.apache.spark.deploy.master.Master I got the error else was unexpected at this time. Ningjun

Spark on Windows 2008 R2 serv er does not work

2015-01-27 Thread Wang, Ningjun (LNG-NPV)
I download and install spark-1.2.0-bin-hadoop2.4.tgz pre-built version on Windows 2008 R2 server. When I submit a job using spark-submit, I got the following error WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform ... using builtin-java

RE: How to start spark master on windows

2015-01-27 Thread Wang, Ningjun (LNG-NPV)
Never mind, the problem is that JAVA is not installed on windows. I install JAVA and the problem go away. Regards, Ningjun Wang Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541 From: Wang, Ningjun (LNG-NPV) [mailto:ningjun.w...@lexisnexis.com] Sent

RE: sparkcontext.objectFile return thousands of partitions

2015-01-22 Thread Wang, Ningjun (LNG-NPV)
) However rdd2 contains thousands of partitions instead of 8 partitions Regards, Ningjun Wang Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541 From: Sean Owen [mailto:so...@cloudera.com] Sent: Wednesday, January 21, 2015 2:32 PM To: Wang, Ningjun (LNG-NPV) Cc

sparkcontext.objectFile return thousands of partitions

2015-01-21 Thread Wang, Ningjun (LNG-NPV)
Why sc.objectFile(...) return a Rdd with thousands of partitions? I save a rdd to file system using rdd.saveAsObjectFile(file:///tmp/mydir) Note that the rdd contains 7 millions object. I check the directory /tmp/mydir/, it contains 8 partitions part-0 part-2 part-4 part-6

RE: Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

2015-01-20 Thread Wang, Ningjun (LNG-NPV)
Can anybody answer this? Do I have to have hdfs to achieve this? Regards, Ningjun Wang Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541 From: Wang, Ningjun (LNG-NPV) [mailto:ningjun.w...@lexisnexis.com] Sent: Friday, January 16, 2015 1:15 PM To: Imran

RE: How to force parallel processing of RDD using multiple thread

2015-01-16 Thread Wang, Ningjun (LNG-NPV)
To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: How to force parallel processing of RDD using multiple thread Check the number of partitions in your input. It may be much less than the available parallelism of your small cluster. For example, input that lives in just 1 partition

Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

2015-01-16 Thread Wang, Ningjun (LNG-NPV)
I have asked this question before but get no answer. Asking again. Can I save RDD to the local file system and then read it back on a spark cluster with multiple nodes? rdd.saveAsObjectFile(file:///home/data/rdd1file:///\\home\data\rdd1) val rdd2 =

RE: How to force parallel processing of RDD using multiple thread

2015-01-16 Thread Wang, Ningjun (LNG-NPV)
: Friday, January 16, 2015 9:44 AM To: Wang, Ningjun (LNG-NPV) Cc: Sean Owen; user@spark.apache.org Subject: Re: How to force parallel processing of RDD using multiple thread Spark will use the number of cores available in the cluster. If your cluster is 1 node with 4 cores, Spark will execute up

RE: Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

2015-01-16 Thread Wang, Ningjun (LNG-NPV)
, Ningjun From: imranra...@gmail.com [mailto:imranra...@gmail.com] On Behalf Of Imran Rashid Sent: Friday, January 16, 2015 12:14 PM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: Can I save RDD to local file system and then read it back on spark cluster with multiple nodes? I'm

How to force parallel processing of RDD using multiple thread

2015-01-15 Thread Wang, Ningjun (LNG-NPV)
I have a standalone spark cluster with only one node with 4 CPU cores. How can I force spark to do parallel processing of my RDD using multiple threads? For example I can do the following Spark-submit --master local[4] However I really want to use the cluster as follow Spark-submit --master

Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

2015-01-14 Thread Wang, Ningjun (LNG-NPV)
Can I save RDD to the local file system and then read it back on a spark cluster with multiple nodes? rdd.saveAsObjectFile(file:///home/data/rdd1file:///\\home\data\rdd1) val rdd2 = sc.objectFile(file:///home/data/rdd1file:///\\home\data\rdd1) This will works if the cluster has only one node.

RE: Failed to save RDD as text file to local file system

2015-01-13 Thread Wang, Ningjun (LNG-NPV)
/_temporary/attempt_201501120831_0001_m_01_5 which failed. Has anybody successfully run r.saveAsTextFile(...) to save RDD to local file system on Linux? Ningjun -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Monday, January 12, 2015 11:25 AM To: Wang, Ningjun (LNG-NPV

RE: Failed to save RDD as text file to local file system

2015-01-13 Thread Wang, Ningjun (LNG-NPV)
?type=nodenode=21127i=1[hidden email]http://user/SendEmail.jtp?type=nodenode=21105i=0] Sent: Monday, January 12, 2015 4:18 AM To: Wang, Ningjun (LNG-NPV) Subject: Re: Failed to save RDD as text file to local file system Have you tried simple giving the path where you want to save the file

subscribe me to the list

2014-12-05 Thread Wang, Ningjun (LNG-NPV)
I would like to subscribe to the user@spark.apache.orgmailto:user@spark.apache.org Regards, Ningjun Wang Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541

SparkContext.textfile() cannot load file using UNC path on windows

2014-11-26 Thread Wang, Ningjun (LNG-NPV)
SparkContext.textfile() cannot load file using UNC path on windows I run the following on Windows XP val conf = new SparkConf().setAppName(testproj1.ClassificationEngine).setMaster(local) val sc = new SparkContext(conf)