data type transform when creating an RDD object

2016-02-17 Thread Lin, Hao
Hi, Quick question on data type transform when creating RDD object. I want to create a person object with "name" and DOB(date of birth): case class Person(name: String, DOB: java.sql.Date) then I want to create an RDD from a text file without the header, e.g. "name" and "DOB". I have

SSE in s3

2016-02-12 Thread Lin, Hao
Hi, Can we configure Spark to enable SSE (Server Side Encryption) for saving files to s3? much appreciated! thanks Confidentiality Notice:: This email, including attachments, may include non-public, proprietary, confidential or legally privileged information. If you are not an intended

sc.textFile the number of the workers to parallelize

2016-02-04 Thread Lin, Hao
Hi, I have a question on the number of workers that Spark enable to parallelize the loading of files using sc.textFile. When I used sc.textFile to access multiple files in AWS S3, it seems to only enable 2 workers regardless of how many worker nodes I have in my cluster. So how does Spark

RE: try to read multiple bz2 files in s3

2016-02-02 Thread Lin, Hao
Hi Xiangrui, For the following problem, I found out an issue ticket you posted before https://issues.apache.org/jira/browse/HADOOP-10614 I wonder if this has been fixed in Spark 1.5.2 which I believe so. Any suggestion on how to fix it? Thanks Hao From: Lin, Hao [mailto:hao@finra.org

RE: try to read multiple bz2 files in s3

2016-02-02 Thread Lin, Hao
Hi Robert, I just use textFile. Here is the simple code: val fs3File=sc.textFile("s3n://my bucket/myfolder/") fs3File.count do you suggest I should use sc.parallelize? many thanks From: Robert Collich [mailto:rcoll...@gmail.com] Sent: Monday, February 01, 2016 6:54 PM To: Lin,

try to read multiple bz2 files in s3

2016-02-01 Thread Lin, Hao
When I tried to read multiple bz2 files from s3, I have the following warning messages. What is the problem here? 16/02/01 22:30:30 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 10.162.67.248): java.lang.ArrayIndexOutOfBoundsException: -1844424343 at

SPARK_WORKER_INSTANCES deprecated

2016-02-01 Thread Lin, Hao
Can I still use SPARK_WORKER_INSTANCES in conf/spark-env.sh? the following is what I’ve got after trying to set this parameter and run spark-shell SPARK_WORKER_INSTANCES was detected (set to '32'). This is deprecated in Spark 1.0+. Please instead use: - ./spark-submit with --num-executors to

RE: SPARK_WORKER_INSTANCES deprecated

2016-02-01 Thread Lin, Hao
If you look at the Spark Doc, variable SPARK_WORKER_INSTANCES can still be specified but yet the SPARK_EXECUTOR_INSTANCES http://spark.apache.org/docs/1.5.2/spark-standalone.html From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Monday, February 01, 2016 5:45 PM To: Lin, Hao Cc: user Subject

how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Lin, Hao
Hi, I have problem accessing local file, with such example: sc.textFile("file:///root/2008.csv").count() with error: File file:/root/2008.csv does not exist. The file clearly exists since, since if I missed type the file name to an non-existing one, it will show: Error: Input path does not

RE: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Lin, Hao
Here you go, thanks. -rw-r--r-- 1 root root 658M Dec 9 2014 /root/2008.csv From: Vijay Gharge [mailto:vijay.gha...@gmail.com] Sent: Friday, December 11, 2015 12:31 PM To: Lin, Hao Cc: user@spark.apache.org Subject: Re: how to access local file from Spark sc.textFile("file:///path to/m

RE: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Lin, Hao
Yes to your question. I have spun up a cluster, login to the master as a root user, run spark-shell, and reference the local file of the master machine. From: Vijay Gharge [mailto:vijay.gha...@gmail.com] Sent: Friday, December 11, 2015 12:50 PM To: Lin, Hao Cc: user@spark.apache.org Subject: Re

RE: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Lin, Hao
To: Lin, Hao Cc: user@spark.apache.org Subject: Re: how to access local file from Spark sc.textFile("file:///path to/myfile") Hm, are you referencing a local file from your remote workers? That won't work as the file only exists in one machine (I presume). On Fri, Dec 11, 2015 at 5:19 PM

RE: Graph visualization tool for GraphX

2015-12-10 Thread Lin, Hao
Hi Andy, quick question, does Spark-Notebook include its own Spark engine, or I need to install Spark separately and point to it from Spark Notebook? thanks From: Lin, Hao [mailto:hao@finra.org] Sent: Tuesday, December 08, 2015 7:01 PM To: andy petrella; Jörn Franke Cc: user@spark.apache.org

Graph visualization tool for GraphX

2015-12-08 Thread Lin, Hao
Hi, Anyone can recommend a great Graph visualization tool for GraphX that can handle truly large Data (~ TB) ? Thanks so much Hao Confidentiality Notice:: This email, including attachments, may include non-public, proprietary, confidential or legally privileged information. If you are not

RE: Graph visualization tool for GraphX

2015-12-08 Thread Lin, Hao
specific ☺. Thanks hao From: Jörn Franke [mailto:jornfra...@gmail.com] Sent: Tuesday, December 08, 2015 11:31 AM To: Lin, Hao Cc: user@spark.apache.org Subject: Re: Graph visualization tool for GraphX I am not sure about your use case. How should a human interpret many terabytes of data in one large

RE: Graph visualization tool for GraphX

2015-12-08 Thread Lin, Hao
Thanks Andy, I certainly will give a try to your suggestion. From: andy petrella [mailto:andy.petre...@gmail.com] Sent: Tuesday, December 08, 2015 1:21 PM To: Lin, Hao; Jörn Franke Cc: user@spark.apache.org Subject: Re: Graph visualization tool for GraphX Hello Lin, This is indeed a tough

Is Temporary Access Credential (AccessKeyId, SecretAccessKey + SecurityToken) support by Spark?

2015-12-04 Thread Lin, Hao
Hi, Does anyone knows if Spark run in AWS is supported by temporary access credential (AccessKeyId, SecretAccessKey + SecurityToken) to access S3? I only see references to specify fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey, without any mention of security token. Apparently this is only

RE: Is Temporary Access Credential (AccessKeyId, SecretAccessKey + SecurityToken) support by Spark?

2015-12-04 Thread Lin, Hao
Thanks, I will keep an eye on it. From: Michal Klos [mailto:michal.klo...@gmail.com] Sent: Friday, December 04, 2015 1:50 PM To: Lin, Hao Cc: user Subject: Re: Is Temporary Access Credential (AccessKeyId, SecretAccessKey + SecurityToken) support by Spark? We were looking into this as well

RE: starting spark-shell throws /tmp/hive on HDFS should be writable error

2015-12-02 Thread Lin, Hao
Mich, did you run this locally or on EC2 (I use EC2)? Is this problem universal or specific to, say EC2? Many thanks From: Mich Talebzadeh [mailto:m...@peridale.co.uk] Sent: Wednesday, December 02, 2015 5:01 PM To: Lin, Hao; user@spark.apache.org Subject: RE: starting spark-shell throws /tmp

RE: starting spark-shell throws /tmp/hive on HDFS should be writable error

2015-12-02 Thread Lin, Hao
I actually don't have the folder /tmp/hive created in my master node, is that a problem? From: Mich Talebzadeh [mailto:m...@peridale.co.uk] Sent: Wednesday, December 02, 2015 5:40 PM To: Lin, Hao; user@spark.apache.org Subject: RE: starting spark-shell throws /tmp/hive on HDFS should be writable

Re: The Processing loading of Spark streaming on YARN is not in balance

2015-04-30 Thread Lin Hao Xu
It seems that the data size is only 2.9MB, far less than the default rdd size. How about put more data into kafka? and what about the number of topic partitions from kafka? Best regards, Lin Hao XU IBM Research China Email: xulin...@cn.ibm.com My Flickr: http://www.flickr.com/photos/xulinhao

Re: Re: implicit function in SparkStreaming

2015-04-29 Thread Lin Hao Xu
For you question, I think the discussion in this link can help. http://apache-spark-user-list.1001560.n3.nabble.com/Error-related-to-serialisation-in-spark-streaming-td6801.html Best regards, Lin Hao XU IBM Research China Email: xulin...@cn.ibm.com My Flickr: http://www.flickr.com/photos

Re: A problem of using spark streaming to capture network packets

2015-04-28 Thread Lin Hao Xu
. 3. We also tested ListPcapNetworkInterface nifs = Pcaps.findAllDevs() in a standard Java program, it really worked like a champion. Best regards, Lin Hao XU IBM Research China Email: xulin...@cn.ibm.com My Flickr: http://www.flickr.com/photos/xulinhao/sets From: Dean Wampler deanwamp

Re: A problem of using spark streaming to capture network packets

2015-04-28 Thread Lin Hao Xu
btw, from spark web ui, the acl is marked with root Best regards, Lin Hao XU IBM Research China Email: xulin...@cn.ibm.com My Flickr: http://www.flickr.com/photos/xulinhao/sets From: Dean Wampler deanwamp...@gmail.com To: Lin Hao Xu/China/IBM@IBMCN Cc: Hai Shan Wu/China/IBM@IBMCN

Re: A problem of using spark streaming to capture network packets

2015-04-28 Thread Lin Hao Xu
Actually, to simplify this problem, we run our program on a single machine with 4 slave workers. Since on a single machine, I think all slave workers are ran with root privilege. BTW, if we have a cluster, how to make sure slaves on remote machines run program as root? Best regards, Lin Hao XU