data type transform when creating an RDD object

2016-02-17 Thread Lin, Hao
Hi,

Quick question on data type transform when creating RDD object.

I want to create a person object with "name" and DOB(date of birth):
case class Person(name: String, DOB: java.sql.Date)

then I want to create an RDD from a text file without the header, e.g. "name" 
and "DOB". I have problem of the following expression, because p(1) is 
previously defined as java.sql.Date in the case class:

val people = 
sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p 
=> Person(p(0), p(1))).toDF()

36: error: type mismatch;
found   : String
required: java.sql.Date

so how do I transform p(1) in the above expression to java.sql.Date.

any help will be appreciated here.

thanks

Confidentiality Notice::  This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information.  If 
you are not an intended recipient or an authorized agent of an intended 
recipient, you are hereby notified that any dissemination, distribution or 
copying of the information contained in or transmitted with this e-mail is 
unauthorized and strictly prohibited.  If you have received this email in 
error, please notify the sender by replying to this message and permanently 
delete this e-mail, its attachments, and any copies of it immediately.  You 
should not retain, copy or use this e-mail or any attachment for any purpose, 
nor disclose all or any part of the contents to any other person. Thank you.


SSE in s3

2016-02-12 Thread Lin, Hao
Hi,

Can we configure Spark to enable SSE (Server Side Encryption) for saving files 
to s3?

much appreciated!

thanks

Confidentiality Notice::  This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information.  If 
you are not an intended recipient or an authorized agent of an intended 
recipient, you are hereby notified that any dissemination, distribution or 
copying of the information contained in or transmitted with this e-mail is 
unauthorized and strictly prohibited.  If you have received this email in 
error, please notify the sender by replying to this message and permanently 
delete this e-mail, its attachments, and any copies of it immediately.  You 
should not retain, copy or use this e-mail or any attachment for any purpose, 
nor disclose all or any part of the contents to any other person. Thank you.


sc.textFile the number of the workers to parallelize

2016-02-04 Thread Lin, Hao
Hi,

I have a question on the number of workers that Spark enable to parallelize the 
loading of files using sc.textFile. When I used sc.textFile to access multiple 
files in AWS S3, it seems to only enable 2 workers regardless of how many 
worker nodes I have in my cluster. So how does Spark configure the 
parallelization in regard of the size of cluster nodes? In the following case, 
spark has 896 tasks split between only two nodes 10.162.97.235 and 
10.162.97.237, while I have 9 nodes in the cluster.

thanks

Example of doing a count:
 scala> s3File.count
16/02/04 18:12:06 INFO SparkContext: Starting job: count at :30
16/02/04 18:12:06 INFO DAGScheduler: Got job 0 (count at :30) with 896 
output partitions
16/02/04 18:12:06 INFO DAGScheduler: Final stage: ResultStage 0 (count at 
:30)
16/02/04 18:12:06 INFO DAGScheduler: Parents of final stage: List()
16/02/04 18:12:06 INFO DAGScheduler: Missing parents: List()
16/02/04 18:12:06 INFO DAGScheduler: Submitting ResultStage 0 
(MapPartitionsRDD[1] at textFile at :27), which has no missing parents
16/02/04 18:12:07 INFO MemoryStore: Block broadcast_1 stored as values in 
memory (estimated size 3.0 KB, free 228.3 KB)
16/02/04 18:12:07 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in 
memory (estimated size 1834.0 B, free 230.1 KB)
16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 
10.162.98.112:46425 (size: 1834.0 B, free: 517.4 MB)
16/02/04 18:12:07 INFO SparkContext: Created broadcast 1 from broadcast at 
DAGScheduler.scala:1006
16/02/04 18:12:07 INFO DAGScheduler: Submitting 896 missing tasks from 
ResultStage 0 (MapPartitionsRDD[1] at textFile at :27)
16/02/04 18:12:07 INFO YarnScheduler: Adding task set 0.0 with 896 tasks
16/02/04 18:12:07 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
10.162.97.235, partition 0,RACK_LOCAL, 2213 bytes)
16/02/04 18:12:07 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
10.162.97.237, partition 1,RACK_LOCAL, 2213 bytes)
16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 
10.162.97.235:38643 (size: 1834.0 B, free: 1259.8 MB)
16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 
10.162.97.237:45360 (size: 1834.0 B, free: 1259.8 MB)
16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
10.162.97.237:45360 (size: 23.8 KB, free: 1259.8 MB)
16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
10.162.97.235:38643 (size: 23.8 KB, free: 1259.8 MB)

Confidentiality Notice::  This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information.  If 
you are not an intended recipient or an authorized agent of an intended 
recipient, you are hereby notified that any dissemination, distribution or 
copying of the information contained in or transmitted with this e-mail is 
unauthorized and strictly prohibited.  If you have received this email in 
error, please notify the sender by replying to this message and permanently 
delete this e-mail, its attachments, and any copies of it immediately.  You 
should not retain, copy or use this e-mail or any attachment for any purpose, 
nor disclose all or any part of the contents to any other person. Thank you.


RE: try to read multiple bz2 files in s3

2016-02-02 Thread Lin, Hao
Hi Xiangrui,

For the following problem, I found out an issue ticket you posted before 
https://issues.apache.org/jira/browse/HADOOP-10614

I wonder if this has been fixed in Spark 1.5.2 which I believe so. Any 
suggestion on how to fix it?

Thanks
Hao


From: Lin, Hao [mailto:hao@finra.org]
Sent: Tuesday, February 02, 2016 10:38 AM
To: Robert Collich; user
Subject: RE: try to read multiple bz2 files in s3

Hi Robert,

I just use textFile. Here is the simple code:

val fs3File=sc.textFile("s3n://my bucket/myfolder/")
fs3File.count

do you suggest I should use sc.parallelize?

many thanks

From: Robert Collich [mailto:rcoll...@gmail.com]
Sent: Monday, February 01, 2016 6:54 PM
To: Lin, Hao; user
Subject: Re: try to read multiple bz2 files in s3

Hi Hao,

Could you please post the corresponding code? Are you using textFile or 
sc.parallelize?

On Mon, Feb 1, 2016 at 2:36 PM Lin, Hao 
<hao@finra.org<mailto:hao@finra.org>> wrote:
When I tried to read multiple bz2 files from s3, I have the following warning 
messages. What is the problem here?

16/02/01 22:30:30 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
10.162.67.248): java.lang.ArrayIndexOutOfBoundsException: -1844424343
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode0(CBZip2InputStream.java:1014)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:829)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:504)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:333)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:399)
at 
org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:483)
at java.io.InputStream.read(InputStream.java:101)
at 
org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader.fillBuffer(CompressedSplitLineReader.java:130)
at 
org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at 
org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader.readLine(CompressedSplitLineReader.java:159)
at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209)
at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:248)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:216)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1553)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1125)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1125)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Confidentiality Notice:: This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information. If you 
are not an intended recipient or an authorized agent of an intended recipient, 
you are hereby notified that any dissemination, distribution or copying of the 
information contained in or transmitted with this e-mail is unauthorized and 
strictly prohibited. If you have received this email in error, please notify 
the sender by replying to this message and permanently delete this e-mail, its 
attachments, and any copies of it immediately. You should not retain, copy or 
use this e-mail or any attachment for any purpose, nor disclose all or any part 
of the contents to any other person. Thank you.
Confidentiality Notice:: This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information. If you 
are not an intended recipient or an authorized agent of an intended recipient, 
you are hereby notified that any dissemination, distribution or copying of the 
information contained in or transmitted with this e-mail is unauthorized and 
strictly prohibited. If you have received this 

RE: try to read multiple bz2 files in s3

2016-02-02 Thread Lin, Hao
Hi Robert,

I just use textFile. Here is the simple code:

val fs3File=sc.textFile("s3n://my bucket/myfolder/")
fs3File.count

do you suggest I should use sc.parallelize?

many thanks

From: Robert Collich [mailto:rcoll...@gmail.com]
Sent: Monday, February 01, 2016 6:54 PM
To: Lin, Hao; user
Subject: Re: try to read multiple bz2 files in s3

Hi Hao,

Could you please post the corresponding code? Are you using textFile or 
sc.parallelize?

On Mon, Feb 1, 2016 at 2:36 PM Lin, Hao 
<hao@finra.org<mailto:hao@finra.org>> wrote:
When I tried to read multiple bz2 files from s3, I have the following warning 
messages. What is the problem here?

16/02/01 22:30:30 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
10.162.67.248): java.lang.ArrayIndexOutOfBoundsException: -1844424343
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode0(CBZip2InputStream.java:1014)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:829)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:504)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:333)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:399)
at 
org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:483)
at java.io.InputStream.read(InputStream.java:101)
at 
org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader.fillBuffer(CompressedSplitLineReader.java:130)
at 
org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at 
org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader.readLine(CompressedSplitLineReader.java:159)
at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209)
at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:248)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:216)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1553)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1125)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1125)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Confidentiality Notice:: This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information. If you 
are not an intended recipient or an authorized agent of an intended recipient, 
you are hereby notified that any dissemination, distribution or copying of the 
information contained in or transmitted with this e-mail is unauthorized and 
strictly prohibited. If you have received this email in error, please notify 
the sender by replying to this message and permanently delete this e-mail, its 
attachments, and any copies of it immediately. You should not retain, copy or 
use this e-mail or any attachment for any purpose, nor disclose all or any part 
of the contents to any other person. Thank you.

Confidentiality Notice::  This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information.  If 
you are not an intended recipient or an authorized agent of an intended 
recipient, you are hereby notified that any dissemination, distribution or 
copying of the information contained in or transmitted with this e-mail is 
unauthorized and strictly prohibited.  If you have received this email in 
error, please notify the sender by replying to this message and permanently 
delete this e-mail, its attachments, and any copies of it immediately.  You 
should not retain, copy or use this e-mail or any attachment for any purpose, 
nor disclose all or any part of the contents to any other person. Thank you.


try to read multiple bz2 files in s3

2016-02-01 Thread Lin, Hao
When I tried to read multiple bz2 files from s3, I have the following warning 
messages. What is the problem here?

16/02/01 22:30:30 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
10.162.67.248): java.lang.ArrayIndexOutOfBoundsException: -1844424343
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode0(CBZip2InputStream.java:1014)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:829)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:504)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:333)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:399)
at 
org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:483)
at java.io.InputStream.read(InputStream.java:101)
at 
org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader.fillBuffer(CompressedSplitLineReader.java:130)
at 
org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at 
org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader.readLine(CompressedSplitLineReader.java:159)
at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209)
at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:248)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:216)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1553)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1125)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1125)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Confidentiality Notice::  This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information.  If 
you are not an intended recipient or an authorized agent of an intended 
recipient, you are hereby notified that any dissemination, distribution or 
copying of the information contained in or transmitted with this e-mail is 
unauthorized and strictly prohibited.  If you have received this email in 
error, please notify the sender by replying to this message and permanently 
delete this e-mail, its attachments, and any copies of it immediately.  You 
should not retain, copy or use this e-mail or any attachment for any purpose, 
nor disclose all or any part of the contents to any other person. Thank you.


SPARK_WORKER_INSTANCES deprecated

2016-02-01 Thread Lin, Hao
Can I still use SPARK_WORKER_INSTANCES in conf/spark-env.sh?  the following is 
what I’ve got after trying to set this parameter and run spark-shell

SPARK_WORKER_INSTANCES was detected (set to '32').
This is deprecated in Spark 1.0+.

Please instead use:
- ./spark-submit with --num-executors to specify the number of executors
- Or set SPARK_EXECUTOR_INSTANCES
- spark.executor.instances to configure the number of instances in the spark 
config.

Confidentiality Notice::  This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information.  If 
you are not an intended recipient or an authorized agent of an intended 
recipient, you are hereby notified that any dissemination, distribution or 
copying of the information contained in or transmitted with this e-mail is 
unauthorized and strictly prohibited.  If you have received this email in 
error, please notify the sender by replying to this message and permanently 
delete this e-mail, its attachments, and any copies of it immediately.  You 
should not retain, copy or use this e-mail or any attachment for any purpose, 
nor disclose all or any part of the contents to any other person. Thank you.


RE: SPARK_WORKER_INSTANCES deprecated

2016-02-01 Thread Lin, Hao
If you look at the Spark Doc, variable SPARK_WORKER_INSTANCES  can still be 
specified but yet the SPARK_EXECUTOR_INSTANCES

http://spark.apache.org/docs/1.5.2/spark-standalone.html


From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Monday, February 01, 2016 5:45 PM
To: Lin, Hao
Cc: user
Subject: Re: SPARK_WORKER_INSTANCES deprecated

As the message (from SparkConf.scala) showed, you shouldn't use 
SPARK_WORKER_INSTANCES any more.

FYI

On Mon, Feb 1, 2016 at 2:19 PM, Lin, Hao 
<hao@finra.org<mailto:hao@finra.org>> wrote:
Can I still use SPARK_WORKER_INSTANCES in conf/spark-env.sh?  the following is 
what I’ve got after trying to set this parameter and run spark-shell

SPARK_WORKER_INSTANCES was detected (set to '32').
This is deprecated in Spark 1.0+.

Please instead use:
- ./spark-submit with --num-executors to specify the number of executors
- Or set SPARK_EXECUTOR_INSTANCES
- spark.executor.instances to configure the number of instances in the spark 
config.
Confidentiality Notice:: This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information. If you 
are not an intended recipient or an authorized agent of an intended recipient, 
you are hereby notified that any dissemination, distribution or copying of the 
information contained in or transmitted with this e-mail is unauthorized and 
strictly prohibited. If you have received this email in error, please notify 
the sender by replying to this message and permanently delete this e-mail, its 
attachments, and any copies of it immediately. You should not retain, copy or 
use this e-mail or any attachment for any purpose, nor disclose all or any part 
of the contents to any other person. Thank you.


Confidentiality Notice::  This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information.  If 
you are not an intended recipient or an authorized agent of an intended 
recipient, you are hereby notified that any dissemination, distribution or 
copying of the information contained in or transmitted with this e-mail is 
unauthorized and strictly prohibited.  If you have received this email in 
error, please notify the sender by replying to this message and permanently 
delete this e-mail, its attachments, and any copies of it immediately.  You 
should not retain, copy or use this e-mail or any attachment for any purpose, 
nor disclose all or any part of the contents to any other person. Thank you.


how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Lin, Hao
Hi,

I have problem accessing local file, with such example:

sc.textFile("file:///root/2008.csv").count()

with error: File file:/root/2008.csv does not exist.
The file clearly exists since, since if I missed type the file name to an 
non-existing one, it will show:

Error: Input path does not exist

Please help!

The following is the error message:

scala> sc.textFile("file:///root/2008.csv").count()
15/12/11 17:12:08 WARN TaskSetManager: Lost task 15.0 in stage 8.0 (TID 498, 
10.162.167.24): java.io.FileNotFoundException: File file:/root/2008.csv does 
not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
at 
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
at 
org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
at 
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

15/12/11 17:12:08 ERROR TaskSetManager: Task 9 in stage 8.0 failed 4 times; 
aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in 
stage 8.0 failed 4 times, most recent failure: Lost task 9.3 in stage 8.0 (TID 
547, 10.162.167.23): java.io.FileNotFoundException: File file:/root/2008.csv 
does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
at 
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
at 
org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
at 
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at 

RE: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Lin, Hao
Here you go, thanks.

-rw-r--r-- 1 root root 658M Dec  9  2014 /root/2008.csv

From: Vijay Gharge [mailto:vijay.gha...@gmail.com]
Sent: Friday, December 11, 2015 12:31 PM
To: Lin, Hao
Cc: user@spark.apache.org
Subject: Re: how to access local file from Spark sc.textFile("file:///path 
to/myfile")

Can you provide output of "ls -lh /root/2008.csv" ?

On Friday 11 December 2015, Lin, Hao 
<hao@finra.org<mailto:hao@finra.org>> wrote:
Hi,

I have problem accessing local file, with such example:

sc.textFile("file:///root/2008.csv").count()

with error: File file:/root/2008.csv does not exist.
The file clearly exists since, since if I missed type the file name to an 
non-existing one, it will show:

Error: Input path does not exist

Please help!

The following is the error message:

scala> sc.textFile("file:///root/2008.csv").count()
15/12/11 17:12:08 WARN TaskSetManager: Lost task 15.0 in stage 8.0 (TID 498, 
10.162.167.24): java.io.FileNotFoundException: File file:/root/2008.csv does 
not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
at 
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
at 
org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
at 
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

15/12/11 17:12:08 ERROR TaskSetManager: Task 9 in stage 8.0 failed 4 times; 
aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in 
stage 8.0 failed 4 times, most recent failure: Lost task 9.3 in stage 8.0 (TID 
547, 10.162.167.23): java.io.FileNotFoundException: File file:/root/2008.csv 
does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
at 
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
at 
org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
at 
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   

RE: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Lin, Hao
Yes to your question. I have spun up a cluster, login to the master as a root 
user, run spark-shell, and reference the local file of the master machine.

From: Vijay Gharge [mailto:vijay.gha...@gmail.com]
Sent: Friday, December 11, 2015 12:50 PM
To: Lin, Hao
Cc: user@spark.apache.org
Subject: Re: how to access local file from Spark sc.textFile("file:///path 
to/myfile")

One more question. Are you also running spark commands using root user ? 
Meanwhile am trying to simulate this locally.

On Friday 11 December 2015, Lin, Hao 
<hao@finra.org<mailto:hao@finra.org>> wrote:
Here you go, thanks.

-rw-r--r-- 1 root root 658M Dec  9  2014 /root/2008.csv

From: Vijay Gharge 
[mailto:vijay.gha...@gmail.com<javascript:_e(%7B%7D,'cvml','vijay.gha...@gmail.com');>]
Sent: Friday, December 11, 2015 12:31 PM
To: Lin, Hao
Cc: user@spark.apache.org<javascript:_e(%7B%7D,'cvml','user@spark.apache.org');>
Subject: Re: how to access local file from Spark sc.textFile("file:///path 
to/myfile")

Can you provide output of "ls -lh /root/2008.csv" ?

On Friday 11 December 2015, Lin, Hao 
<hao@finra.org<javascript:_e(%7B%7D,'cvml','hao@finra.org');>> wrote:
Hi,

I have problem accessing local file, with such example:

sc.textFile("file:///root/2008.csv").count()

with error: File file:/root/2008.csv does not exist.
The file clearly exists since, since if I missed type the file name to an 
non-existing one, it will show:

Error: Input path does not exist

Please help!

The following is the error message:

scala> sc.textFile("file:///root/2008.csv").count()
15/12/11 17:12:08 WARN TaskSetManager: Lost task 15.0 in stage 8.0 (TID 498, 
10.162.167.24): java.io.FileNotFoundException: File file:/root/2008.csv does 
not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
at 
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
at 
org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
at 
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

15/12/11 17:12:08 ERROR TaskSetManager: Task 9 in stage 8.0 failed 4 times; 
aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in 
stage 8.0 failed 4 times, most recent failure: Lost task 9.3 in stage 8.0 (TID 
547, 10.162.167.23): java.io.FileNotFoundException: File file:/root/2008.csv 
does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
at 
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
at 
org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
at 
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(H

RE: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Lin, Hao
I logged into master of my cluster and referenced the local file of the master 
node machine.  And yes that file only resides on master node, not on any of the 
remote workers.  

-Original Message-
From: Sean Owen [mailto:so...@cloudera.com] 
Sent: Friday, December 11, 2015 1:00 PM
To: Lin, Hao
Cc: user@spark.apache.org
Subject: Re: how to access local file from Spark sc.textFile("file:///path 
to/myfile")

Hm, are you referencing a local file from your remote workers? That won't work 
as the file only exists in one machine (I presume).

On Fri, Dec 11, 2015 at 5:19 PM, Lin, Hao <hao@finra.org> wrote:
> Hi,
>
>
>
> I have problem accessing local file, with such example:
>
>
>
> sc.textFile("file:///root/2008.csv").count()
>
>
>
> with error: File file:/root/2008.csv does not exist.
>
> The file clearly exists since, since if I missed type the file name to 
> an non-existing one, it will show:
>
>
>
> Error: Input path does not exist
>
>
>
> Please help!
>
>
>
> The following is the error message:
>
>
>
> scala> sc.textFile("file:///root/2008.csv").count()
>
> 15/12/11 17:12:08 WARN TaskSetManager: Lost task 15.0 in stage 8.0 
> (TID 498,
> 10.162.167.24): java.io.FileNotFoundException: File 
> file:/root/2008.csv does not exist
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLoc
> alFileSystem.java:511)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawL
> ocalFileSystem.java:724)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSyst
> em.java:501)
>
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.j
> ava:397)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(
> ChecksumFileSystem.java:137)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:3
> 39)
>
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
>
> at
> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java
> :108)
>
> at
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputForm
> at.java:67)
>
> at
> org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
>
> at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:3
> 8)
>
> at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j
> ava:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
> java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> 15/12/11 17:12:08 ERROR TaskSetManager: Task 9 in stage 8.0 failed 4 
> times; aborting job
>
> org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 9 in stage 8.0 failed 4 times, most recent failure: Lost task 9.3 
> in stage 8.0 (TID 547, 10.162.167.23): java.io.FileNotFoundException: 
> File file:/root/2008.csv does not exist
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLoc
> alFileSystem.java:511)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawL
> ocalFileSystem.java:724)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSyst
> em.java:501)
>
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.j
> ava:397)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(
> ChecksumFileSystem.java:137)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:3
> 39)
>
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
>
> at
> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java
> :108)
>
> at
> org.apache.hadoop.mapred.TextInputFormat.getRecordR

RE: Graph visualization tool for GraphX

2015-12-10 Thread Lin, Hao
Hi Andy, quick question, does Spark-Notebook include its own Spark engine, or I 
need to install Spark separately and point to it from Spark Notebook? thanks

From: Lin, Hao [mailto:hao@finra.org]
Sent: Tuesday, December 08, 2015 7:01 PM
To: andy petrella; Jörn Franke
Cc: user@spark.apache.org
Subject: RE: Graph visualization tool for GraphX

Thanks Andy, I certainly will give a try to your suggestion.

From: andy petrella [mailto:andy.petre...@gmail.com]
Sent: Tuesday, December 08, 2015 1:21 PM
To: Lin, Hao; Jörn Franke
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Graph visualization tool for GraphX

Hello Lin,

This is indeed a tough scenario when you have many vertices (and even worst) 
many edges...

So two-fold answer:
First, technically, there is a graph plotting support in the spark notebook 
(https://github.com/andypetrella/spark-notebook/[github.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_andypetrella_spark-2Dnotebook_=CwMFaQ=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=hEbdsAy--QyqEffI1SO48sF4L8qAUh-2ABY1GA8lJ1s=btftZ-dWpn030poyyZmVMqv46oKPca3dR8InALUt_FI=>
 → check this notebook: 
https://github.com/andypetrella/spark-notebook/blob/master/notebooks/viz/Graph%20Plots.snb[github.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_andypetrella_spark-2Dnotebook_blob_master_notebooks_viz_Graph-2520Plots.snb=CwMFaQ=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=hEbdsAy--QyqEffI1SO48sF4L8qAUh-2ABY1GA8lJ1s=Ps4dk3ePteW7s9712REWtNXVtroxc_0S7apFyJni5lU=>).
 You can plot graph from scala, which will convert to D3 with force layout 
force field.
The number or the points which you will plot are "sampled" using a `Sampler` 
that you can provide yourself. Which leads to the second fold of this answer.

Plotting a large graph is rather tough because there is no real notion of 
dimension... there is always the option to dig the topological analysis theory 
to find good homeomorphism ... but won't be that efficient ;-D.
Best is to find a good approach to generalize/summarize the information, there 
are many many techniques (that you can find in mainly geospatial viz and 
biology viz theories...)
Best is to check what will match your need the fastest.
There are quick techniques like using unsupervised clustering models and then 
plot a voronoi diagram (which can be approached using force layout).

In general term I might say that multiscaling is intuitively what you want 
first: this is an interesting paper presenting the foundations: 
https://www.cs.ubc.ca/~tmm/courses/533-07/readings/auberIV03Seattle.pdf[cs.ubc.ca]<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cs.ubc.ca_-7Etmm_courses_533-2D07_readings_auberIV03Seattle.pdf=CwMFaQ=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=hEbdsAy--QyqEffI1SO48sF4L8qAUh-2ABY1GA8lJ1s=N3p_GQ2tUHGQ6sjyYfAfg2UcfC1mqfGDaEWlHS5VeCs=>

Oh and BTW, to end this longish mail, while looking for new papers on that, I 
felt on this one: 
http://vacommunity.org/egas2015/papers/IGAS2015-ScottLangevin.pdf[vacommunity.org]<https://urldefense.proofpoint.com/v2/url?u=http-3A__vacommunity.org_egas2015_papers_IGAS2015-2DScottLangevin.pdf=CwMFaQ=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=hEbdsAy--QyqEffI1SO48sF4L8qAUh-2ABY1GA8lJ1s=r02nBLxV_-lc996UtERk8VfxQ4eFMRU9dzqHq9Fhtyo=>
 which is using
1. Spark !!!
2. a tile based approach (~ to tiling + pyramids in geospatial)

HTH

PS regarding the Spark Notebook, you can always come and discuss on gitter: 
https://gitter.im/andypetrella/spark-notebook[gitter.im]<https://urldefense.proofpoint.com/v2/url?u=https-3A__gitter.im_andypetrella_spark-2Dnotebook=CwMFaQ=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=hEbdsAy--QyqEffI1SO48sF4L8qAUh-2ABY1GA8lJ1s=PuYO74CXBeRxGrdoe4TmK9ezEZfQWYN4bMLcZAJ12iE=>


On Tue, Dec 8, 2015 at 6:30 PM Lin, Hao 
<hao@finra.org<mailto:hao@finra.org>> wrote:
Hello Jorn,

Thank you for the reply and being tolerant of my over simplified question. I 
should’ve been more specific.  Though ~TB of data, there will be about billions 
of records (edges) and 100,000 nodes. We need to visualize the social networks 
graph like what can be done by Gephi which has limitation on scalability to 
handle such amount of data. There will be dozens of users to access and the 
response time is also critical.  We would like to run the visualization tool on 
the remote ec2 server where webtool can be a good choice for us.

Please let me know if I need to be more specific ☺.  Thanks
hao

From: Jörn Franke [mailto:jornfra...@gmail.com<mailto:jornfra...@gmail.com>]
Sent: Tuesday, December 08, 2015 11:31 AM
To: Lin, Hao
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Graph visualization tool for GraphX

I am not sure about your use case. How should a human interpret ma

Graph visualization tool for GraphX

2015-12-08 Thread Lin, Hao
Hi,

Anyone can recommend a great Graph visualization tool for GraphX  that can 
handle truly large Data (~ TB) ?

Thanks so much
Hao

Confidentiality Notice::  This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information.  If 
you are not an intended recipient or an authorized agent of an intended 
recipient, you are hereby notified that any dissemination, distribution or 
copying of the information contained in or transmitted with this e-mail is 
unauthorized and strictly prohibited.  If you have received this email in 
error, please notify the sender by replying to this message and permanently 
delete this e-mail, its attachments, and any copies of it immediately.  You 
should not retain, copy or use this e-mail or any attachment for any purpose, 
nor disclose all or any part of the contents to any other person. Thank you.


RE: Graph visualization tool for GraphX

2015-12-08 Thread Lin, Hao
Hello Jorn,

Thank you for the reply and being tolerant of my over simplified question. I 
should’ve been more specific.  Though ~TB of data, there will be about billions 
of records (edges) and 100,000 nodes. We need to visualize the social networks 
graph like what can be done by Gephi which has limitation on scalability to 
handle such amount of data. There will be dozens of users to access and the 
response time is also critical.  We would like to run the visualization tool on 
the remote ec2 server where webtool can be a good choice for us.

Please let me know if I need to be more specific ☺.  Thanks
hao

From: Jörn Franke [mailto:jornfra...@gmail.com]
Sent: Tuesday, December 08, 2015 11:31 AM
To: Lin, Hao
Cc: user@spark.apache.org
Subject: Re: Graph visualization tool for GraphX

I am not sure about your use case. How should a human interpret many terabytes 
of data in one large visualization?? You have to be more specific, what part of 
the data needs to be visualized, what kind of visualization, what navigation do 
you expect within the visualisation, how many users, response time, web tool vs 
mobile vs Desktop etc

On 08 Dec 2015, at 16:46, Lin, Hao 
<hao@finra.org<mailto:hao@finra.org>> wrote:
Hi,

Anyone can recommend a great Graph visualization tool for GraphX  that can 
handle truly large Data (~ TB) ?

Thanks so much
Hao
Confidentiality Notice:: This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information. If you 
are not an intended recipient or an authorized agent of an intended recipient, 
you are hereby notified that any dissemination, distribution or copying of the 
information contained in or transmitted with this e-mail is unauthorized and 
strictly prohibited. If you have received this email in error, please notify 
the sender by replying to this message and permanently delete this e-mail, its 
attachments, and any copies of it immediately. You should not retain, copy or 
use this e-mail or any attachment for any purpose, nor disclose all or any part 
of the contents to any other person. Thank you.

Confidentiality Notice::  This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information.  If 
you are not an intended recipient or an authorized agent of an intended 
recipient, you are hereby notified that any dissemination, distribution or 
copying of the information contained in or transmitted with this e-mail is 
unauthorized and strictly prohibited.  If you have received this email in 
error, please notify the sender by replying to this message and permanently 
delete this e-mail, its attachments, and any copies of it immediately.  You 
should not retain, copy or use this e-mail or any attachment for any purpose, 
nor disclose all or any part of the contents to any other person. Thank you.


RE: Graph visualization tool for GraphX

2015-12-08 Thread Lin, Hao
Thanks Andy, I certainly will give a try to your suggestion.

From: andy petrella [mailto:andy.petre...@gmail.com]
Sent: Tuesday, December 08, 2015 1:21 PM
To: Lin, Hao; Jörn Franke
Cc: user@spark.apache.org
Subject: Re: Graph visualization tool for GraphX

Hello Lin,

This is indeed a tough scenario when you have many vertices (and even worst) 
many edges...

So two-fold answer:
First, technically, there is a graph plotting support in the spark notebook 
(https://github.com/andypetrella/spark-notebook/[github.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_andypetrella_spark-2Dnotebook_=CwMFaQ=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=hEbdsAy--QyqEffI1SO48sF4L8qAUh-2ABY1GA8lJ1s=btftZ-dWpn030poyyZmVMqv46oKPca3dR8InALUt_FI=>
 → check this notebook: 
https://github.com/andypetrella/spark-notebook/blob/master/notebooks/viz/Graph%20Plots.snb[github.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_andypetrella_spark-2Dnotebook_blob_master_notebooks_viz_Graph-2520Plots.snb=CwMFaQ=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=hEbdsAy--QyqEffI1SO48sF4L8qAUh-2ABY1GA8lJ1s=Ps4dk3ePteW7s9712REWtNXVtroxc_0S7apFyJni5lU=>).
 You can plot graph from scala, which will convert to D3 with force layout 
force field.
The number or the points which you will plot are "sampled" using a `Sampler` 
that you can provide yourself. Which leads to the second fold of this answer.

Plotting a large graph is rather tough because there is no real notion of 
dimension... there is always the option to dig the topological analysis theory 
to find good homeomorphism ... but won't be that efficient ;-D.
Best is to find a good approach to generalize/summarize the information, there 
are many many techniques (that you can find in mainly geospatial viz and 
biology viz theories...)
Best is to check what will match your need the fastest.
There are quick techniques like using unsupervised clustering models and then 
plot a voronoi diagram (which can be approached using force layout).

In general term I might say that multiscaling is intuitively what you want 
first: this is an interesting paper presenting the foundations: 
https://www.cs.ubc.ca/~tmm/courses/533-07/readings/auberIV03Seattle.pdf[cs.ubc.ca]<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cs.ubc.ca_-7Etmm_courses_533-2D07_readings_auberIV03Seattle.pdf=CwMFaQ=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=hEbdsAy--QyqEffI1SO48sF4L8qAUh-2ABY1GA8lJ1s=N3p_GQ2tUHGQ6sjyYfAfg2UcfC1mqfGDaEWlHS5VeCs=>

Oh and BTW, to end this longish mail, while looking for new papers on that, I 
felt on this one: 
http://vacommunity.org/egas2015/papers/IGAS2015-ScottLangevin.pdf[vacommunity.org]<https://urldefense.proofpoint.com/v2/url?u=http-3A__vacommunity.org_egas2015_papers_IGAS2015-2DScottLangevin.pdf=CwMFaQ=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=hEbdsAy--QyqEffI1SO48sF4L8qAUh-2ABY1GA8lJ1s=r02nBLxV_-lc996UtERk8VfxQ4eFMRU9dzqHq9Fhtyo=>
 which is using
1. Spark !!!
2. a tile based approach (~ to tiling + pyramids in geospatial)

HTH

PS regarding the Spark Notebook, you can always come and discuss on gitter: 
https://gitter.im/andypetrella/spark-notebook[gitter.im]<https://urldefense.proofpoint.com/v2/url?u=https-3A__gitter.im_andypetrella_spark-2Dnotebook=CwMFaQ=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=hEbdsAy--QyqEffI1SO48sF4L8qAUh-2ABY1GA8lJ1s=PuYO74CXBeRxGrdoe4TmK9ezEZfQWYN4bMLcZAJ12iE=>


On Tue, Dec 8, 2015 at 6:30 PM Lin, Hao 
<hao@finra.org<mailto:hao@finra.org>> wrote:
Hello Jorn,

Thank you for the reply and being tolerant of my over simplified question. I 
should’ve been more specific.  Though ~TB of data, there will be about billions 
of records (edges) and 100,000 nodes. We need to visualize the social networks 
graph like what can be done by Gephi which has limitation on scalability to 
handle such amount of data. There will be dozens of users to access and the 
response time is also critical.  We would like to run the visualization tool on 
the remote ec2 server where webtool can be a good choice for us.

Please let me know if I need to be more specific ☺.  Thanks
hao

From: Jörn Franke [mailto:jornfra...@gmail.com<mailto:jornfra...@gmail.com>]
Sent: Tuesday, December 08, 2015 11:31 AM
To: Lin, Hao
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Graph visualization tool for GraphX

I am not sure about your use case. How should a human interpret many terabytes 
of data in one large visualization?? You have to be more specific, what part of 
the data needs to be visualized, what kind of visualization, what navigation do 
you expect within the visualisation, how many users, response time, web tool vs 
mobile vs Desktop etc

On 08 Dec 2015, at 16:46, Lin, Hao 
<hao@finra.org<mailto:hao@finra.org>> wrote:
Hi,

Is Temporary Access Credential (AccessKeyId, SecretAccessKey + SecurityToken) support by Spark?

2015-12-04 Thread Lin, Hao
Hi,

Does anyone knows if Spark run in AWS is supported by temporary access 
credential (AccessKeyId, SecretAccessKey + SecurityToken) to access S3?  I only 
see references to specify fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey, 
without any mention of security token. Apparently this is only for static 
credential.

Many thanks

Confidentiality Notice::  This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information.  If 
you are not an intended recipient or an authorized agent of an intended 
recipient, you are hereby notified that any dissemination, distribution or 
copying of the information contained in or transmitted with this e-mail is 
unauthorized and strictly prohibited.  If you have received this email in 
error, please notify the sender by replying to this message and permanently 
delete this e-mail, its attachments, and any copies of it immediately.  You 
should not retain, copy or use this e-mail or any attachment for any purpose, 
nor disclose all or any part of the contents to any other person. Thank you.


RE: Is Temporary Access Credential (AccessKeyId, SecretAccessKey + SecurityToken) support by Spark?

2015-12-04 Thread Lin, Hao
Thanks, I will keep an eye on it.

From: Michal Klos [mailto:michal.klo...@gmail.com]
Sent: Friday, December 04, 2015 1:50 PM
To: Lin, Hao
Cc: user
Subject: Re: Is Temporary Access Credential (AccessKeyId, SecretAccessKey + 
SecurityToken) support by Spark?

We were looking into this as well --- the answer looks like "no"

Here's the ticket:
https://issues.apache.org/jira/browse/HADOOP-9680[issues.apache.org]<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D9680=CwMFaQ=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=VNKYE8PBt9wiTeKo0-hzMttXYOVJbMUfFkgZlSKOP5s=OJAcssexvkFXTfzo7tu1hLc8VAFhQCVWHcO7brCa8Rs=>

m


On Fri, Dec 4, 2015 at 1:41 PM, Lin, Hao 
<hao@finra.org<mailto:hao@finra.org>> wrote:
Hi,

Does anyone knows if Spark run in AWS is supported by temporary access 
credential (AccessKeyId, SecretAccessKey + SecurityToken) to access S3?  I only 
see references to specify fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey, 
without any mention of security token. Apparently this is only for static 
credential.

Many thanks
Confidentiality Notice:: This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information. If you 
are not an intended recipient or an authorized agent of an intended recipient, 
you are hereby notified that any dissemination, distribution or copying of the 
information contained in or transmitted with this e-mail is unauthorized and 
strictly prohibited. If you have received this email in error, please notify 
the sender by replying to this message and permanently delete this e-mail, its 
attachments, and any copies of it immediately. You should not retain, copy or 
use this e-mail or any attachment for any purpose, nor disclose all or any part 
of the contents to any other person. Thank you.


Confidentiality Notice::  This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information.  If 
you are not an intended recipient or an authorized agent of an intended 
recipient, you are hereby notified that any dissemination, distribution or 
copying of the information contained in or transmitted with this e-mail is 
unauthorized and strictly prohibited.  If you have received this email in 
error, please notify the sender by replying to this message and permanently 
delete this e-mail, its attachments, and any copies of it immediately.  You 
should not retain, copy or use this e-mail or any attachment for any purpose, 
nor disclose all or any part of the contents to any other person. Thank you.


RE: starting spark-shell throws /tmp/hive on HDFS should be writable error

2015-12-02 Thread Lin, Hao
Mich, did you run this locally or on EC2 (I use EC2)?  Is this problem 
universal or specific to, say EC2?  Many thanks

From: Mich Talebzadeh [mailto:m...@peridale.co.uk]
Sent: Wednesday, December 02, 2015 5:01 PM
To: Lin, Hao; user@spark.apache.org
Subject: RE: starting spark-shell throws /tmp/hive on HDFS should be writable 
error

Hi,

Actually I went back to 1.3, 1.3.1 and 1.4 and built spark from source code 
with no luck.

So I am not sure if any good result is going to come from back tracking?

Cheers,

Mich Talebzadeh

Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf[login.sybase.com]<https://urldefense.proofpoint.com/v2/url?u=http-3A__login.sybase.com_files_Product-5FOverviews_ASE-2DWinning-2DStrategy-2D091908.pdf=CwMFAg=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=QinEV1-DOnM0qRbga2o_-IP-6cPniwts0F8Rk3CmpCQ=S876LfQmjdqIc4Wh6_spNAYyN6n6pX1ltzZp3VuSb-s=>
Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

http://talebzadehmich.wordpress.com[talebzadehmich.wordpress.com]<https://urldefense.proofpoint.com/v2/url?u=http-3A__talebzadehmich.wordpress.com_=CwMFAg=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=QinEV1-DOnM0qRbga2o_-IP-6cPniwts0F8Rk3CmpCQ=IupCrV9sVc16ecw3H1ecZZ3jzcEP04V5cTuHS_m46Nw=>

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

From: Lin, Hao [mailto:hao@finra.org]
Sent: 02 December 2015 21:49
To: Mich Talebzadeh <m...@peridale.co.uk<mailto:m...@peridale.co.uk>>; 
user@spark.apache.org<mailto:user@spark.apache.org>
Subject: RE: starting spark-shell throws /tmp/hive on HDFS should be writable 
error

I also have the same problem on my side using version 1.5.0. Just wonder if 
anyone has any update on this. Should I go back to the earlier version, like 
1.4.0?

thanks

From: Mich Talebzadeh [mailto:m...@peridale.co.uk]
Sent: Friday, November 20, 2015 5:43 PM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: FW: starting spark-shell throws /tmp/hive on HDFS should be writable 
error

From: Mich Talebzadeh [mailto:m...@peridale.co.uk]
Sent: 20 November 2015 21:14
To: u...@hive.apache.org<mailto:u...@hive.apache.org>
Subject: starting spark-shell throws /tmp/hive on HDFS should be writable error

Hi,

Has this been resolved. I don't think this has anything to do with /tmp/hive 
directory permission

spark-shell
log4j:WARN No appenders could be found for logger 
(org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See 
http://logging.apache.org/log4j/1.2/faq.html#noconfig[logging.apache.org]<https://urldefense.proofpoint.com/v2/url?u=http-3A__logging.apache.org_log4j_1.2_faq.html-23noconfig=CwMFAg=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=oHOKhu2DKTwCpcPeTgb9VszWiqFOXGY6GF4SGjoqEy4=W2rVckqSVOz4yoVE5AE9_-zp32w4jlYKl4E5YGWs5IM=>
 for more info.
Using Spark's repl log4j profile: 
org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.2
  /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_25)
Type in expressions to have them evaluated.
Type :help for more information.
java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
/tmp/hive on HDFS should be writable. Current permissions are: rwx--


:10: error: not found: value sqlContext
   import sqlContext.implicits._
  ^
:10: error: not found: value sqlContext
   import sqlContext.sql
  ^

scala>

Thanks,


Mich Talebzadeh

Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf[login.sybase.com]<https://urldefense.proofpoint.com/v2/url?u=http-3A__log

RE: starting spark-shell throws /tmp/hive on HDFS should be writable error

2015-12-02 Thread Lin, Hao
I actually don't have the folder /tmp/hive created in my master node, is that a 
problem?

From: Mich Talebzadeh [mailto:m...@peridale.co.uk]
Sent: Wednesday, December 02, 2015 5:40 PM
To: Lin, Hao; user@spark.apache.org
Subject: RE: starting spark-shell throws /tmp/hive on HDFS should be writable 
error

Locally

thx

Mich Talebzadeh

Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf[login.sybase.com]<https://urldefense.proofpoint.com/v2/url?u=http-3A__login.sybase.com_files_Product-5FOverviews_ASE-2DWinning-2DStrategy-2D091908.pdf=CwMFAg=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=DLEHMSw6P5G-otI4X-xe4mn-W7R24w9_jtduSgqjrrY=X_AATP-s1XbX6zGrchbqdjRzDHVgru2rfijtkHkpc_M=>
Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

http://talebzadehmich.wordpress.com[talebzadehmich.wordpress.com]<https://urldefense.proofpoint.com/v2/url?u=http-3A__talebzadehmich.wordpress.com_=CwMFAg=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=DLEHMSw6P5G-otI4X-xe4mn-W7R24w9_jtduSgqjrrY=Izk8KEsMf-1SYXm0YwQXXKfiZf1ZsMtc5ARC0SbRqZE=>

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

From: Lin, Hao [mailto:hao@finra.org]
Sent: 02 December 2015 22:24
To: Mich Talebzadeh <m...@peridale.co.uk<mailto:m...@peridale.co.uk>>; 
user@spark.apache.org<mailto:user@spark.apache.org>
Subject: RE: starting spark-shell throws /tmp/hive on HDFS should be writable 
error

Mich, did you run this locally or on EC2 (I use EC2)?  Is this problem 
universal or specific to, say EC2?  Many thanks

From: Mich Talebzadeh [mailto:m...@peridale.co.uk]
Sent: Wednesday, December 02, 2015 5:01 PM
To: Lin, Hao; user@spark.apache.org<mailto:user@spark.apache.org>
Subject: RE: starting spark-shell throws /tmp/hive on HDFS should be writable 
error

Hi,

Actually I went back to 1.3, 1.3.1 and 1.4 and built spark from source code 
with no luck.

So I am not sure if any good result is going to come from back tracking?

Cheers,

Mich Talebzadeh

Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf[login.sybase.com]<https://urldefense.proofpoint.com/v2/url?u=http-3A__login.sybase.com_files_Product-5FOverviews_ASE-2DWinning-2DStrategy-2D091908.pdf=CwMFAg=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=QinEV1-DOnM0qRbga2o_-IP-6cPniwts0F8Rk3CmpCQ=S876LfQmjdqIc4Wh6_spNAYyN6n6pX1ltzZp3VuSb-s=>
Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

http://talebzadehmich.wordpress.com[talebzadehmich.wordpress.com]<https://urldefense.proofpoint.com/v2/url?u=http-3A__talebzadehmich.wordpress.com_=CwMFAg=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=QinEV1-DOnM0qRbga2o_-IP-6cPniwts0F8Rk3CmpCQ=IupCrV9sVc16ecw3H1ecZZ3jzcEP04V5cTuHS_m46Nw=>

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

From: Lin, Hao [mailto:hao@finra.org]
Sent: 02 December 2015 21:49
To: Mich Talebzadeh <m...@peridale.co.uk<mailto:m...@peridale.co.uk>>; 
user@spark.apache.org<mailto:user@spark.apache.org>
Subject: RE: s

Re: The Processing loading of Spark streaming on YARN is not in balance

2015-04-30 Thread Lin Hao Xu

It seems that the data size is only 2.9MB, far less than the default rdd
size. How about put more data into kafka? and what about the number of
topic partitions from kafka?

Best regards,

Lin Hao XU
IBM Research China
Email: xulin...@cn.ibm.com
My Flickr: http://www.flickr.com/photos/xulinhao/sets



From:   Kyle Lin kylelin2...@gmail.com
To: user@spark.apache.org user@spark.apache.org
Date:   2015/04/30 14:39
Subject:The Processing loading of Spark streaming on YARN is not in
balance



Hi all

My environment info
Hadoop release version: HDP 2.1
Kakfa: 0.8.1.2.1.4.0
Spark: 1.1.0

My question:
    I ran Spark streaming program on YARN. My Spark streaming program will
read data from Kafka and doing some processing. But, I found there is
always only ONE executor under processing. As following table, I had 10
work executors, but only Executor No.5 is running now. And all RRD Blocks
are in Exectuors No.5. For my situation, the processing looks not
distributed. How can I make all executors runs together.
|+-+++---+-+-+---++---+--+-+-|
|Executor|   Address   |  RDD   |Memory Used | Disk  | Active  | Failed  | 
Complete  | Total  | Task  |Input | Shuffle | Shuffle |
|   ID   | | Blocks || Used  |  Tasks  |  Tasks  |   
Tasks   | Tasks  | Time  |  |  Read   |  Write  |
|+-+++---+-+-+---++---+--+-+-|
|1   |slave03:38662|0   |0.0 B / |0.0 B  |0|0|58
 |58  |24.9 s |0.0 B |1197.0 B |440.5 KB |
|| ||265.4 MB|   | | |  
 ||   |  | | |
|+-+++---+-+-+---++---+--+-+-|
|10  |slave05:36992|0   |0.0 B / |0.0 B  |0|0|44
 |44  |30.9 s |0.0 B |0.0 B|501.7 KB |
|| ||265.4 MB|   | | |  
 ||   |  | | |
|+-+++---+-+-+---++---+--+-+-|
|2   |slave02:40250|0   |0.0 B / |0.0 B  |0|0|30
 |30  |19.6 s |0.0 B |855.0 B  |1026.0 B |
|| ||265.4 MB|   | | |  
 ||   |  | | |
|+-+++---+-+-+---++---+--+-+-|
|3   |slave01:40882|0   |0.0 B / |0.0 B  |0|0|28
 |28  |20.7 s |0.0 B |1197.0 B |1026.0 B |
|| ||265.4 MB|   | | |  
 ||   |  | | |
|+-+++---+-+-+---++---+--+-+-|
|4   |slave04:57068|0   |0.0 B / |0.0 B  |0|0|29
 |29  |20.9 s |0.0 B |1368.0 B |1026.0 B |
|| ||265.4 MB|   | | |  
 ||   |  | | |
|+-+++---+-+-+---++---+--+-+-|
|5   |slave05:40191|23  |2.9 MB /|0.0 B  |1|0|3928  
 |3929|5.1 m  |2.7 MB|261.7 KB |564.4 KB |
|| ||265.4 MB|   | | |  
 ||   |  | | |
|+-+++---+-+-+---++---+--+-+-|
|6   |slave03:35515|0   |0.0 B / |0.0 B  |1|0|47
 |48  |23.1 s |0.0 B |513.0 B  |400.4 KB |
|| ||265.4 MB|   | | |  
 ||   |  | | |
|+-+++---+-+-+---++---+--+-+-|
|7   |slave02:40325|0   |0.0 B / |0.0 B  |0|0|30
 |30  |20.2 s |0.0 B |855.0 B  |1197.0 B |
|| ||265.4 MB|   | | |  
 ||   |  | | |
|+-+++---+-+-+---++---+--+-+-|
|8   |slave01:48609|0   |0.0 B / |0.0 B  |0|0|28
 |28  |20.5 s |0.0 B |1363.0 B |1026.0 B |
|| ||265.4 MB

Re: Re: implicit function in SparkStreaming

2015-04-29 Thread Lin Hao Xu

For you question, I think the discussion in this link can help.

http://apache-spark-user-list.1001560.n3.nabble.com/Error-related-to-serialisation-in-spark-streaming-td6801.html

Best regards,

Lin Hao XU
IBM Research China
Email: xulin...@cn.ibm.com
My Flickr: http://www.flickr.com/photos/xulinhao/sets



From:   guoqing0...@yahoo.com.hk guoqing0...@yahoo.com.hk
To: Tathagata Das t...@databricks.com
Cc: user user@spark.apache.org
Date:   2015/04/30 11:17
Subject:Re: Re: implicit function in SparkStreaming



Appreciate for your help , it works . i`m curious why the enclosing class
cannot serialized , is it need to extends  java.io.Serializable ? if object
never serialized how it works in the task .whether there`s any association
with the spark.closure.serializer .


 guoqing0...@yahoo.com.hk

 From: Tathagata Das
 Date: 2015-04-30 09:30
 To: guoqing0...@yahoo.com.hk
 CC: user
 Subject: Re: Re: implicit function in SparkStreaming
 Could you put the implicit def in an object? That should work, as objects
 are never serialized.


 On Wed, Apr 29, 2015 at 6:28 PM, guoqing0...@yahoo.com.hk 
 guoqing0...@yahoo.com.hk wrote:
   Thank you for your pointers , it`s very helpful to me , in this scenario
   how can i use the implicit def in the enclosing class ?



From: Tathagata Das
Date: 2015-04-30 07:00
To: guoqing0...@yahoo.com.hk
CC: user
Subject: Re: implicit function in SparkStreaming
I believe that the implicit def is pulling in the enclosing class (in
which the def is defined) in the closure which is not serializable.


On Wed, Apr 29, 2015 at 4:20 AM, guoqing0...@yahoo.com.hk 
guoqing0...@yahoo.com.hk wrote:
 Hi guys,
 I`m puzzled why i cant use the implicit function in spark streaming to
 cause Task not serializable .

 code snippet:
 implicit final def str2KeyValue(s:String): (String,String) = {
   val message = s.split(\\|)
   if(message.length = 2)
 (message(0),message(1))
   else if(message.length == 1) {
 (message(0), )
   }
   else
 (,)
 }

 def filter(stream:DStream[String]) :DStream[String] = {
   stream.filter(s = {
 (s._1==Action  s._2==TRUE)
   })

 Could you please give me some pointers ? Thank you .


  guoqing0...@yahoo.com.hk



Re: A problem of using spark streaming to capture network packets

2015-04-28 Thread Lin Hao Xu
1. The full command line is written in a shell script:

LIB=/home/spark/.m2/repository

/opt/spark/bin/spark-submit \
--class spark.pcap.run.TestPcapSpark \
--jars
$LIB/org/pcap4j/pcap4j-core/1.4.0/pcap4j-core-1.4.0.jar,$LIB/org/pcap4j/pcap4j-packetfactory-static/1.4.0/pcap4j-packetfactory-static-1.4.0.jar,$LIB/
org/slf4j/slf4j-api/1.7.6/slf4j-api-1.7.6.jar,$LIB/org/slf4j/slf4j-log4j12/1.7.6/slf4j-log4j12-1.7.6.jar,$LIB/net/java/dev/jna/jna/4.1.0/jna-4.1.0.jar
 \
/home/spark/napa/napa.jar

2. And we run this script with sudo, if you do not use sudo, then you
cannot access network interface.

3. We also tested ListPcapNetworkInterface nifs = Pcaps.findAllDevs() in
a standard Java program, it really worked like a champion.

Best regards,

Lin Hao XU
IBM Research China
Email: xulin...@cn.ibm.com
My Flickr: http://www.flickr.com/photos/xulinhao/sets



From:   Dean Wampler deanwamp...@gmail.com
To: Hai Shan Wu/China/IBM@IBMCN
Cc: user user@spark.apache.org, Lin Hao Xu/China/IBM@IBMCN
Date:   2015/04/28 20:07
Subject:Re: A problem of using spark streaming to capture network
packets



It's probably not your code.

What's the full command line you use to submit the job?

Are you sure the job on the cluster has access to the network interface?
Can you test the receiver by itself without Spark? For example, does this
line work as expected:

ListPcapNetworkInterface nifs = Pcaps.findAllDevs();

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition (O'Reilly)
Typesafe
@deanwampler
http://polyglotprogramming.com

On Mon, Apr 27, 2015 at 4:03 AM, Hai Shan Wu wuh...@cn.ibm.com wrote:
  Hi Everyone

  We use pcap4j to capture network packets and then use spark streaming to
  analyze captured packets. However, we met a strange problem.

  If we run our application on spark locally (for example, spark-submit
  --master local[2]), then the program runs successfully.

  If we run our application on spark standalone cluster, then the program
  will tell us that NO NIFs found.

  I also attach two test files for clarification.

  So anyone can help on this? Thanks in advance!


  (See attached file: PcapReceiver.java)(See attached file:
  TestPcapSpark.java)

  Best regards,

  - Haishan

  Haishan Wu (吴海珊)

  IBM Research - China
  Tel: 86-10-58748508
  Fax: 86-10-58748330
  Email: wuh...@cn.ibm.com
  Lotus Notes: Hai Shan Wu/China/IBM


  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org


Re: A problem of using spark streaming to capture network packets

2015-04-28 Thread Lin Hao Xu
btw, from spark web ui, the acl is marked with root

Best regards,

Lin Hao XU
IBM Research China
Email: xulin...@cn.ibm.com
My Flickr: http://www.flickr.com/photos/xulinhao/sets



From:   Dean Wampler deanwamp...@gmail.com
To: Lin Hao Xu/China/IBM@IBMCN
Cc: Hai Shan Wu/China/IBM@IBMCN, user user@spark.apache.org
Date:   2015/04/29 09:40
Subject:Re: A problem of using spark streaming to capture network
packets



Are the tasks on the slaves also running as root? If not, that might
explain the problem.

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition (O'Reilly)
Typesafe
@deanwampler
http://polyglotprogramming.com

On Tue, Apr 28, 2015 at 8:30 PM, Lin Hao Xu xulin...@cn.ibm.com wrote:
  1. The full command line is written in a shell script:

  LIB=/home/spark/.m2/repository

  /opt/spark/bin/spark-submit \
  --class spark.pcap.run.TestPcapSpark \
  --jars
  
$LIB/org/pcap4j/pcap4j-core/1.4.0/pcap4j-core-1.4.0.jar,$LIB/org/pcap4j/pcap4j-packetfactory-static/1.4.0/pcap4j-packetfactory-static-1.4.0.jar,$LIB/

  
org/slf4j/slf4j-api/1.7.6/slf4j-api-1.7.6.jar,$LIB/org/slf4j/slf4j-log4j12/1.7.6/slf4j-log4j12-1.7.6.jar,$LIB/net/java/dev/jna/jna/4.1.0/jna-4.1.0.jar
 \
  /home/spark/napa/napa.jar

  2. And we run this script with sudo, if you do not use sudo, then you
  cannot access network interface.

  3. We also tested ListPcapNetworkInterface nifs = Pcaps.findAllDevs()
  in a standard Java program, it really worked like a champion.

  Best regards,

  Lin Hao XU
  IBM Research China
  Email: xulin...@cn.ibm.com
  My Flickr: http://www.flickr.com/photos/xulinhao/sets

  Inactive hide details for Dean Wampler ---2015/04/28 20:07:54---It's
  probably not your code. What's the full command line you uDean Wampler
  ---2015/04/28 20:07:54---It's probably not your code. What's the full
  command line you use to submit the job?

  From: Dean Wampler deanwamp...@gmail.com
  To: Hai Shan Wu/China/IBM@IBMCN
  Cc: user user@spark.apache.org, Lin Hao Xu/China/IBM@IBMCN
  Date: 2015/04/28 20:07
  Subject: Re: A problem of using spark streaming to capture network
  packets




  It's probably not your code.

  What's the full command line you use to submit the job?

  Are you sure the job on the cluster has access to the network interface?
  Can you test the receiver by itself without Spark? For example, does this
  line work as expected:

  ListPcapNetworkInterface nifs = Pcaps.findAllDevs();

  dean

  Dean Wampler, Ph.D.
  Author: Programming Scala, 2nd Edition (O'Reilly)
  Typesafe
  @deanwampler
  http://polyglotprogramming.com

  On Mon, Apr 27, 2015 at 4:03 AM, Hai Shan Wu wuh...@cn.ibm.com wrote:
Hi Everyone

We use pcap4j to capture network packets and then use spark
streaming to analyze captured packets. However, we met a strange
problem.

If we run our application on spark locally (for example,
spark-submit --master local[2]), then the program runs
successfully.

If we run our application on spark standalone cluster, then the
program will tell us that NO NIFs found.

I also attach two test files for clarification.

So anyone can help on this? Thanks in advance!


(See attached file: PcapReceiver.java)(See attached file:
TestPcapSpark.java)

Best regards,

- Haishan

Haishan Wu (吴海珊)

IBM Research - China
Tel: 86-10-58748508
Fax: 86-10-58748330
Email: wuh...@cn.ibm.com
Lotus Notes: Hai Shan Wu/China/IBM


-

To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org







Re: A problem of using spark streaming to capture network packets

2015-04-28 Thread Lin Hao Xu
Actually, to simplify this problem, we run our program on a single machine
with 4 slave workers. Since on a single machine, I think all slave workers
are ran with root privilege.

BTW, if we have a cluster, how to make sure slaves on remote machines run
program as root?

Best regards,

Lin Hao XU
IBM Research China
Email: xulin...@cn.ibm.com
My Flickr: http://www.flickr.com/photos/xulinhao/sets



From:   Dean Wampler deanwamp...@gmail.com
To: Lin Hao Xu/China/IBM@IBMCN
Cc: Hai Shan Wu/China/IBM@IBMCN, user user@spark.apache.org
Date:   2015/04/29 09:40
Subject:Re: A problem of using spark streaming to capture network
packets



Are the tasks on the slaves also running as root? If not, that might
explain the problem.

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition (O'Reilly)
Typesafe
@deanwampler
http://polyglotprogramming.com

On Tue, Apr 28, 2015 at 8:30 PM, Lin Hao Xu xulin...@cn.ibm.com wrote:
  1. The full command line is written in a shell script:

  LIB=/home/spark/.m2/repository

  /opt/spark/bin/spark-submit \
  --class spark.pcap.run.TestPcapSpark \
  --jars
  
$LIB/org/pcap4j/pcap4j-core/1.4.0/pcap4j-core-1.4.0.jar,$LIB/org/pcap4j/pcap4j-packetfactory-static/1.4.0/pcap4j-packetfactory-static-1.4.0.jar,$LIB/

  
org/slf4j/slf4j-api/1.7.6/slf4j-api-1.7.6.jar,$LIB/org/slf4j/slf4j-log4j12/1.7.6/slf4j-log4j12-1.7.6.jar,$LIB/net/java/dev/jna/jna/4.1.0/jna-4.1.0.jar
 \
  /home/spark/napa/napa.jar

  2. And we run this script with sudo, if you do not use sudo, then you
  cannot access network interface.

  3. We also tested ListPcapNetworkInterface nifs = Pcaps.findAllDevs()
  in a standard Java program, it really worked like a champion.

  Best regards,

  Lin Hao XU
  IBM Research China
  Email: xulin...@cn.ibm.com
  My Flickr: http://www.flickr.com/photos/xulinhao/sets

  Inactive hide details for Dean Wampler ---2015/04/28 20:07:54---It's
  probably not your code. What's the full command line you uDean Wampler
  ---2015/04/28 20:07:54---It's probably not your code. What's the full
  command line you use to submit the job?

  From: Dean Wampler deanwamp...@gmail.com
  To: Hai Shan Wu/China/IBM@IBMCN
  Cc: user user@spark.apache.org, Lin Hao Xu/China/IBM@IBMCN
  Date: 2015/04/28 20:07
  Subject: Re: A problem of using spark streaming to capture network
  packets




  It's probably not your code.

  What's the full command line you use to submit the job?

  Are you sure the job on the cluster has access to the network interface?
  Can you test the receiver by itself without Spark? For example, does this
  line work as expected:

  ListPcapNetworkInterface nifs = Pcaps.findAllDevs();

  dean

  Dean Wampler, Ph.D.
  Author: Programming Scala, 2nd Edition (O'Reilly)
  Typesafe
  @deanwampler
  http://polyglotprogramming.com

  On Mon, Apr 27, 2015 at 4:03 AM, Hai Shan Wu wuh...@cn.ibm.com wrote:
Hi Everyone

We use pcap4j to capture network packets and then use spark
streaming to analyze captured packets. However, we met a strange
problem.

If we run our application on spark locally (for example,
spark-submit --master local[2]), then the program runs
successfully.

If we run our application on spark standalone cluster, then the
program will tell us that NO NIFs found.

I also attach two test files for clarification.

So anyone can help on this? Thanks in advance!


(See attached file: PcapReceiver.java)(See attached file:
TestPcapSpark.java)

Best regards,

- Haishan

Haishan Wu (吴海珊)

IBM Research - China
Tel: 86-10-58748508
Fax: 86-10-58748330
Email: wuh...@cn.ibm.com
Lotus Notes: Hai Shan Wu/China/IBM


-

To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org