?
Ningjun
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Tuesday, February 23, 2016 2:30 PM
To: Kevin Mellott
Cc: Wang, Ningjun (LNG-NPV); user@spark.apache.org
Subject: Re: How to get progress information of an RDD operation
I think Ningjun was looking for programmatic way of tracking progress.
I took
How can I get progress information of a RDD operation? For example
val lines = sc.textFile("c:/temp/input.txt") // a RDD of millions of line
lines.foreach(line => {
handleLine(line)
})
The input.txt contains millions of lines. The entire operation take 6 hours. I
want to print out
How can I create a RDD from a SQL query against SQLServer database? Here is the
example of dataframe
http://spark.apache.org/docs/latest/sql-programming-guide.html#overview
val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:postgresql:dbserver",
"dbtable" ->
This is a very helpful article. Thanks for the help.
Ningjun
From: Sujit Pal [mailto:sujitatgt...@gmail.com]
Sent: Monday, December 07, 2015 12:42 PM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: Re: How to create dataframe from SQL Server SQL query
Hi Ningjun,
Haven't done
, October 13, 2015 10:42 AM
To: user@spark.apache.org
Subject: Re: Why is my spark executor is terminated?
Hi Ningjun,
Nothing special in the master log ?
Regards
JB
On 10/13/2015 04:34 PM, Wang, Ningjun (LNG-NPV) wrote:
> We use spark on windows 2008 R2 servers. We use one spark context
> which
We use spark on windows 2008 R2 servers. We use one spark context which create
one spark executor. We run spark master, slave, driver, executor on one single
machine.
>From time to time, we found that the executor JAVA process was terminated. I
>cannot fig out why it was terminated. Can
Does anybody have any idea how to solve this problem?
Ningjun
From: Wang, Ningjun (LNG-NPV)
Sent: Thursday, July 30, 2015 11:06 AM
To: user@spark.apache.org
Subject: How to register array class with Kyro in spark-defaults.conf
I register my class with Kyro in spark-defaults.conf as follow
: Friday, July 31, 2015 11:49 AM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: Re: How to register array class with Kyro in spark-defaults.conf
For the second exception, was there anything following SparkException which
would give us more clue ?
Can you tell us how EsDoc
I register my class with Kyro in spark-defaults.conf as follow
spark.serializer
org.apache.spark.serializer.KryoSerializer
spark.kryo.registrationRequired true
spark.kryo.classesToRegister ltn.analytics.es.EsDoc
But I got the following
Does anybody have any idea what cuase this problem? Thanks.
Ningjun
From: Wang, Ningjun (LNG-NPV)
Sent: Wednesday, July 15, 2015 11:09 AM
To: user@spark.apache.org
Subject: java.lang.NoClassDefFoundError: Could not initialize class
org.fusesource.jansi.internal.Kernel32
I just installed spark
I just installed spark 1.3.1 on windows 2008 server. When I start spark-shell,
I got the following error
Failed to created SparkJLineReader: java.lang.NoClassDefFoundError: Could not
initialize class org.fusesource.jansi.internal.Kernel32
Please advise. Thanks.
Ningjun
In rdd.mapPartition(...) if I try to iterate through the items in the
partition, everything screw. For example
val rdd = sc.parallelize(1 to 1000, 3)
val count = rdd.mapPartitions(iter = {
println(iter.length)
iter
}).count()
The count is 0. This is incorrect. The count should be 1000. If
I try to measure how spark standalone cluster performance scale out with
multiple machines. I did a test of training the SVM model which is heavy in
memory computation. I measure the run time for spark standalone cluster of 1 -
3 nodes, the result is following
1 node: 35 minutes
2 nodes: 30.1
I think the problem is that in my local etc/hosts file, I have
10.196.116.95 WIN02
I will remove it and try. Thanks for the help.
Ningjun
From: prajod.vettiyat...@wipro.com [mailto:prajod.vettiyat...@wipro.com]
Sent: Friday, June 12, 2015 1:44 AM
To: Wang, Ningjun (LNG-NPV)
Cc: user
I start spark master on windows using
bin\spark-class.cmd org.apache.spark.deploy.master.Master
Then I goto http://localhost:8080/ to find the master URL, it is
spark://WIN02:7077
Here WIN02 is my machine name. Why does it missing the domain name? If I start
the spark master on other
\mysharefile:///\\10.196.119.230\myshare
Ningjun
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Friday, May 22, 2015 5:02 PM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: Re: spark on Windows 2008 failed to save RDD to windows shared folder
The stack trace is related to hdfs.
Can you tell
I used spark standalone cluster on Windows 2008. I kept on getting the
following error when trying to save an RDD to a windows shared folder
rdd.saveAsObjectFile(file:///T:/lab4-win02/IndexRoot01/tobacco-07/myrdd.obj)
15/05/22 16:49:05 ERROR Executor: Exception in task 0.0 in stage 12.0 (TID
...@cloudera.com]
Sent: Thursday, May 21, 2015 11:30 AM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: Re: rdd.sample() methods very slow
I guess the fundamental issue is that these aren't stored in a way that allows
random access to a Document.
Underneath, Hadoop has a concept of a MapFile
document).
How can I do this quickly? The rdd.sample() methods does not help because it
need to read the entire RDD of 7 million Document from disk which take very
long time.
Ningjun
From: Sean Owen [mailto:so...@cloudera.com]
Sent: Tuesday, May 19, 2015 4:51 PM
To: Wang, Ningjun (LNG-NPV)
Cc
Hi
I have an RDD[Document] that contains 7 million objects and it is saved in file
system as object file. I want to get a random sample of about 70 objects from
it using rdd.sample() method. It is ver slow
val rdd : RDD[Document] =
sc.objectFile[Document](C:/temp/docs.obj).sample(false,
)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Ningjun
From: Jonathan Coveney [mailto:jcove...@gmail.com]
Sent: Wednesday, May 06, 2015 5:23 PM
To: Wang, Ningjun (LNG-NPV)
Cc: Ted Yu; user@spark.apache.org
:32 AM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: Re: java.io.IOException: org.apache.spark.SparkException: Failed to
get broadcast_2_piece0
Which release of Spark are you using ?
Thanks
On May 6, 2015, at 8:03 AM, Wang, Ningjun (LNG-NPV)
ningjun.w
I run a job on spark standalone cluster and got the exception below
Here is the line of code that cause problem
val myRdd: RDD[(String, String, String)] = ... // RDD of (docid, cattegory,
path)
myRdd.persist(StorageLevel.MEMORY_AND_DISK_SER)
val cats: Array[String] = myRdd.map(t =
a DataFrame to
RDD and then invoke the recudeByKey
Ningjun
From: ayan guha [mailto:guha.a...@gmail.com]
Sent: Thursday, April 30, 2015 3:41 AM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: RE: HOw can I merge multiple DataFrame and remove duplicated key
1. Do a group by and get
I have multiple DataFrame objects each stored in a parquet file. The DataFrame
just contains 3 columns (id, value, timeStamp). I need to union all the
DataFrame objects together but for duplicated id only keep the record with the
latest timestamp. How can I do that?
I can do this for RDDs
, value2, 2015-01-02
id2, value4, 2015-01-02
I can use reduceByKey() in RDD but how to do it using DataFrame? Can you give
an example code snipet?
Thanks
Ningjun
From: ayan guha [mailto:guha.a...@gmail.com]
Sent: Wednesday, April 29, 2015 5:54 PM
To: Wang, Ningjun (LNG-NPV)
Cc: user
I have two RDDs each saved in a parquet file. I need to join this two RDDs by
the id column. Can I created index on the id column so they can join faster?
Here is the code
case class Example(val id: String, val category: String)
case class DocVector(val id: String, val vector: Vector)
val
I tried to convert an RDD to a data frame using the example codes on spark
website
case class Person(name: String, age: Int)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val people =
Does anybody have a solution for this?
From: Wang, Ningjun (LNG-NPV)
Sent: Tuesday, April 14, 2015 10:41 AM
To: user@spark.apache.org
Subject: How to join RDD keyValuePairs efficiently
I have an RDD that contains millions of Document objects. Each document has an
unique Id that is a string. I
IndexedRDD on the web
https://github.com/amplab/spark-indexedrdd
Has anybody use it?
Ningjun
-Original Message-
From: Evo Eftimov [mailto:evo.efti...@isecc.com]
Sent: Thursday, April 16, 2015 12:18 PM
To: 'Sean Owen'; Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: RE: How
=-Dspark.worker.cleanup.enabled=true
-Dspark.worker.cleanup.appDataTtl=seconds
On 11.04.2015, at 00:01, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.commailto:ningjun.w...@lexisnexis.com wrote:
Does anybody have an answer for this?
Thanks
Ningjun
From: Wang, Ningjun (LNG-NPV)
Sent: Thursday, April 02
I have an RDD that contains millions of Document objects. Each document has an
unique Id that is a string. I need to find the documents by ids quickly.
Currently I used RDD join as follow
First I save the RDD as object file
allDocs : RDD[Document] = getDocs() // this RDD contains 7 million
Does anybody have an answer for this?
Thanks
Ningjun
From: Wang, Ningjun (LNG-NPV)
Sent: Thursday, April 02, 2015 12:14 PM
To: user@spark.apache.org
Subject: Is the disk space in SPARK_LOCAL_DIRS cleanned up?
I set SPARK_LOCAL_DIRS to C:\temp\spark-temp. When RDDs are shuffled, spark
I set SPARK_LOCAL_DIRS to C:\temp\spark-temp. When RDDs are shuffled, spark
writes to this folder. I found that the disk space of this folder keep on
increase quickly and at certain point I will run out of disk space.
I wonder does spark clean up the disk spac in this folder once the
: Mark Hamstra [mailto:m...@clearstorydata.com]
Sent: Thursday, March 26, 2015 12:37 PM
To: Sean Owen
Cc: Wang, Ningjun (LNG-NPV); user@spark.apache.org
Subject: Re: How to get rdd count() without double evaluation of the RDD?
You can also always take the more extreme approach of using SparkContext
I have a rdd that is expensive to compute. I want to save it as object file and
also print the count. How can I avoid double computation of the RDD?
val rdd = sc.textFile(someFile).map(line = expensiveCalculation(line))
val count = rdd.count() // this force computation of the rdd
: Wednesday, March 11, 2015 2:40 AM
To: Wang, Ningjun (LNG-NPV)
Cc: java8964; user@spark.apache.org
Subject: Re: sc.textFile() on windows cannot access UNC path
I don't have a complete example for your usecase, but you can see a lot of
codes showing how to use new APIHadoopFile from
herehttps
Thanks for the suggestion. I will try that.
Ningjun
From: Silvio Fiorito [mailto:silvio.fior...@granturing.com]
Sent: Wednesday, March 11, 2015 12:40 AM
To: Wang, Ningjun (LNG-NPV); user@spark.apache.org
Subject: Re: Is it possible to use windows service to start and stop spark
standalone
We are using spark stand alone cluster on Windows 2008 R2. I can start spark
clusters by open an command prompt and run the following
bin\spark-class.cmd org.apache.spark.deploy.master.Master
bin\spark-class.cmd org.apache.spark.deploy.worker.Worker
spark://mywin.mydomain.com:7077
I can stop
I am running Spark on windows 2008 R2. I use sc.textFile() to load text file
using UNC path, it does not work.
sc.textFile(rawfile:10.196.119.230/folder1/abc.txt, 4).count()
Input path does not exist: file:/10.196.119.230/folder1/abc.txt
org.apache.hadoop.mapred.InvalidInputException:
(...)?
Ningjun
From: java8964 [mailto:java8...@hotmail.com]
Sent: Monday, March 09, 2015 5:33 PM
To: Wang, Ningjun (LNG-NPV); user@spark.apache.org
Subject: RE: sc.textFile() on windows cannot access UNC path
This is a Java problem, not really Spark.
From this page:
http://stackoverflow.com/questions
I have multiple RDD[(String, String)] that store (docId, docText) pairs, e.g.
rdd1: (id1, Long text 1), (id2, Long text 2), (id3, Long text 3)
rdd2: (id1, Long text 1 A), (id2, Long text 2 A)
rdd3: (id1, Long text 1 B)
Then, I want to merge all RDDs. If there is duplicated docids, later
is appreciated because I am new to Spark.
Ningjun
From: Boromir Widas [mailto:vcsub...@gmail.com]
Sent: Friday, February 13, 2015 1:28 PM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: Re: How to union RDD and remove duplicated keys
reducebyKey should work, but you need to define the ordering
it integrate with our existing
app easily.
Has anybody use spark on windows for production system? Is spark reliable on
windows?
Ningjun
From: gen tang [mailto:gen.tan...@gmail.com]
Sent: Thursday, January 29, 2015 12:53 PM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: Re: Fail
only use
local file system and do not have any hdfs file system at all. I don’t
understand why spark generate so many error on Hadoop while we don’t even need
hdfs.
Ningjun
From: gen tang [mailto:gen.tan...@gmail.com]
Sent: Thursday, January 29, 2015 10:45 AM
To: Wang, Ningjun (LNG-NPV)
Cc: user
,
Ningjun Wang
Consulting Software Engineer
LexisNexis
121 Chanlon Road
New Providence, NJ 07974-1541
-Original Message-
From: Marcelo Vanzin [mailto:van...@cloudera.com]
Sent: Wednesday, January 28, 2015 5:15 PM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: Re: Spark on Windows
Has anybody successfully install and run spark-1.2.0 on windows 2008 R2 or
windows 7? How do you get that works?
Regards,
Ningjun Wang
Consulting Software Engineer
LexisNexis
121 Chanlon Road
New Providence, NJ 07974-1541
From: Wang, Ningjun (LNG-NPV) [mailto:ningjun.w...@lexisnexis.com]
Sent
I download spark 1.2.0 on my windows server 2008. How do I start spark master?
I tried to run the following on command prompt
C:\spark-1.2.0-bin-hadoop2.4 bin\spark-class.cmd
org.apache.spark.deploy.master.Master
I got the error
else was unexpected at this time.
Ningjun
I download and install spark-1.2.0-bin-hadoop2.4.tgz pre-built version on
Windows 2008 R2 server. When I submit a job using spark-submit, I got the
following error
WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop
library for your platform
... using builtin-java
Never mind, the problem is that JAVA is not installed on windows. I install
JAVA and the problem go away.
Regards,
Ningjun Wang
Consulting Software Engineer
LexisNexis
121 Chanlon Road
New Providence, NJ 07974-1541
From: Wang, Ningjun (LNG-NPV) [mailto:ningjun.w...@lexisnexis.com]
Sent
)
However rdd2 contains thousands of partitions instead of 8 partitions
Regards,
Ningjun Wang
Consulting Software Engineer
LexisNexis
121 Chanlon Road
New Providence, NJ 07974-1541
From: Sean Owen [mailto:so...@cloudera.com]
Sent: Wednesday, January 21, 2015 2:32 PM
To: Wang, Ningjun (LNG-NPV)
Cc
Why sc.objectFile(...) return a Rdd with thousands of partitions?
I save a rdd to file system using
rdd.saveAsObjectFile(file:///tmp/mydir)
Note that the rdd contains 7 millions object. I check the directory
/tmp/mydir/, it contains 8 partitions
part-0 part-2 part-4 part-6
Can anybody answer this? Do I have to have hdfs to achieve this?
Regards,
Ningjun Wang
Consulting Software Engineer
LexisNexis
121 Chanlon Road
New Providence, NJ 07974-1541
From: Wang, Ningjun (LNG-NPV) [mailto:ningjun.w...@lexisnexis.com]
Sent: Friday, January 16, 2015 1:15 PM
To: Imran
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: Re: How to force parallel processing of RDD using multiple thread
Check the number of partitions in your input. It may be much less than the
available parallelism of your small cluster. For example, input that lives in
just 1 partition
I have asked this question before but get no answer. Asking again.
Can I save RDD to the local file system and then read it back on a spark
cluster with multiple nodes?
rdd.saveAsObjectFile(file:///home/data/rdd1file:///\\home\data\rdd1)
val rdd2 =
: Friday, January 16, 2015 9:44 AM
To: Wang, Ningjun (LNG-NPV)
Cc: Sean Owen; user@spark.apache.org
Subject: Re: How to force parallel processing of RDD using multiple thread
Spark will use the number of cores available in the cluster. If your cluster is
1 node with 4 cores, Spark will execute up
,
Ningjun
From: imranra...@gmail.com [mailto:imranra...@gmail.com] On Behalf Of Imran
Rashid
Sent: Friday, January 16, 2015 12:14 PM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: Re: Can I save RDD to local file system and then read it back on spark
cluster with multiple nodes?
I'm
I have a standalone spark cluster with only one node with 4 CPU cores. How can
I force spark to do parallel processing of my RDD using multiple threads? For
example I can do the following
Spark-submit --master local[4]
However I really want to use the cluster as follow
Spark-submit --master
Can I save RDD to the local file system and then read it back on a spark
cluster with multiple nodes?
rdd.saveAsObjectFile(file:///home/data/rdd1file:///\\home\data\rdd1)
val rdd2 = sc.objectFile(file:///home/data/rdd1file:///\\home\data\rdd1)
This will works if the cluster has only one node.
/_temporary/attempt_201501120831_0001_m_01_5 which failed.
Has anybody successfully run r.saveAsTextFile(...) to save RDD to local file
system on Linux?
Ningjun
-Original Message-
From: Sean Owen [mailto:so...@cloudera.com]
Sent: Monday, January 12, 2015 11:25 AM
To: Wang, Ningjun (LNG-NPV
?type=nodenode=21127i=1[hidden
email]http://user/SendEmail.jtp?type=nodenode=21105i=0]
Sent: Monday, January 12, 2015 4:18 AM
To: Wang, Ningjun (LNG-NPV)
Subject: Re: Failed to save RDD as text file to local file system
Have you tried simple giving the path where you want to save the file
I would like to subscribe to the
user@spark.apache.orgmailto:user@spark.apache.org
Regards,
Ningjun Wang
Consulting Software Engineer
LexisNexis
121 Chanlon Road
New Providence, NJ 07974-1541
SparkContext.textfile() cannot load file using UNC path on windows
I run the following on Windows XP
val conf = new
SparkConf().setAppName(testproj1.ClassificationEngine).setMaster(local)
val sc = new SparkContext(conf)
63 matches
Mail list logo