data type transform when creating an RDD object
Hi, Quick question on data type transform when creating RDD object. I want to create a person object with "name" and DOB(date of birth): case class Person(name: String, DOB: java.sql.Date) then I want to create an RDD from a text file without the header, e.g. "name" and "DOB". I have problem of the following expression, because p(1) is previously defined as java.sql.Date in the case class: val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1))).toDF() 36: error: type mismatch; found : String required: java.sql.Date so how do I transform p(1) in the above expression to java.sql.Date. any help will be appreciated here. thanks Confidentiality Notice:: This email, including attachments, may include non-public, proprietary, confidential or legally privileged information. If you are not an intended recipient or an authorized agent of an intended recipient, you are hereby notified that any dissemination, distribution or copying of the information contained in or transmitted with this e-mail is unauthorized and strictly prohibited. If you have received this email in error, please notify the sender by replying to this message and permanently delete this e-mail, its attachments, and any copies of it immediately. You should not retain, copy or use this e-mail or any attachment for any purpose, nor disclose all or any part of the contents to any other person. Thank you.
SSE in s3
Hi, Can we configure Spark to enable SSE (Server Side Encryption) for saving files to s3? much appreciated! thanks Confidentiality Notice:: This email, including attachments, may include non-public, proprietary, confidential or legally privileged information. If you are not an intended recipient or an authorized agent of an intended recipient, you are hereby notified that any dissemination, distribution or copying of the information contained in or transmitted with this e-mail is unauthorized and strictly prohibited. If you have received this email in error, please notify the sender by replying to this message and permanently delete this e-mail, its attachments, and any copies of it immediately. You should not retain, copy or use this e-mail or any attachment for any purpose, nor disclose all or any part of the contents to any other person. Thank you.
sc.textFile the number of the workers to parallelize
Hi, I have a question on the number of workers that Spark enable to parallelize the loading of files using sc.textFile. When I used sc.textFile to access multiple files in AWS S3, it seems to only enable 2 workers regardless of how many worker nodes I have in my cluster. So how does Spark configure the parallelization in regard of the size of cluster nodes? In the following case, spark has 896 tasks split between only two nodes 10.162.97.235 and 10.162.97.237, while I have 9 nodes in the cluster. thanks Example of doing a count: scala> s3File.count 16/02/04 18:12:06 INFO SparkContext: Starting job: count at :30 16/02/04 18:12:06 INFO DAGScheduler: Got job 0 (count at :30) with 896 output partitions 16/02/04 18:12:06 INFO DAGScheduler: Final stage: ResultStage 0 (count at :30) 16/02/04 18:12:06 INFO DAGScheduler: Parents of final stage: List() 16/02/04 18:12:06 INFO DAGScheduler: Missing parents: List() 16/02/04 18:12:06 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at textFile at :27), which has no missing parents 16/02/04 18:12:07 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.0 KB, free 228.3 KB) 16/02/04 18:12:07 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1834.0 B, free 230.1 KB) 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.162.98.112:46425 (size: 1834.0 B, free: 517.4 MB) 16/02/04 18:12:07 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006 16/02/04 18:12:07 INFO DAGScheduler: Submitting 896 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at textFile at :27) 16/02/04 18:12:07 INFO YarnScheduler: Adding task set 0.0 with 896 tasks 16/02/04 18:12:07 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 10.162.97.235, partition 0,RACK_LOCAL, 2213 bytes) 16/02/04 18:12:07 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 10.162.97.237, partition 1,RACK_LOCAL, 2213 bytes) 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.162.97.235:38643 (size: 1834.0 B, free: 1259.8 MB) 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.162.97.237:45360 (size: 1834.0 B, free: 1259.8 MB) 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.162.97.237:45360 (size: 23.8 KB, free: 1259.8 MB) 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.162.97.235:38643 (size: 23.8 KB, free: 1259.8 MB) Confidentiality Notice:: This email, including attachments, may include non-public, proprietary, confidential or legally privileged information. If you are not an intended recipient or an authorized agent of an intended recipient, you are hereby notified that any dissemination, distribution or copying of the information contained in or transmitted with this e-mail is unauthorized and strictly prohibited. If you have received this email in error, please notify the sender by replying to this message and permanently delete this e-mail, its attachments, and any copies of it immediately. You should not retain, copy or use this e-mail or any attachment for any purpose, nor disclose all or any part of the contents to any other person. Thank you.
RE: try to read multiple bz2 files in s3
Hi Xiangrui, For the following problem, I found out an issue ticket you posted before https://issues.apache.org/jira/browse/HADOOP-10614 I wonder if this has been fixed in Spark 1.5.2 which I believe so. Any suggestion on how to fix it? Thanks Hao From: Lin, Hao [mailto:hao@finra.org] Sent: Tuesday, February 02, 2016 10:38 AM To: Robert Collich; user Subject: RE: try to read multiple bz2 files in s3 Hi Robert, I just use textFile. Here is the simple code: val fs3File=sc.textFile("s3n://my bucket/myfolder/") fs3File.count do you suggest I should use sc.parallelize? many thanks From: Robert Collich [mailto:rcoll...@gmail.com] Sent: Monday, February 01, 2016 6:54 PM To: Lin, Hao; user Subject: Re: try to read multiple bz2 files in s3 Hi Hao, Could you please post the corresponding code? Are you using textFile or sc.parallelize? On Mon, Feb 1, 2016 at 2:36 PM Lin, Hao <hao@finra.org<mailto:hao@finra.org>> wrote: When I tried to read multiple bz2 files from s3, I have the following warning messages. What is the problem here? 16/02/01 22:30:30 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 10.162.67.248): java.lang.ArrayIndexOutOfBoundsException: -1844424343 at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode0(CBZip2InputStream.java:1014) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:829) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:504) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:333) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:399) at org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:483) at java.io.InputStream.read(InputStream.java:101) at org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader.fillBuffer(CompressedSplitLineReader.java:130) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) at org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader.readLine(CompressedSplitLineReader.java:159) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:248) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:216) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1553) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1125) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1125) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Confidentiality Notice:: This email, including attachments, may include non-public, proprietary, confidential or legally privileged information. If you are not an intended recipient or an authorized agent of an intended recipient, you are hereby notified that any dissemination, distribution or copying of the information contained in or transmitted with this e-mail is unauthorized and strictly prohibited. If you have received this email in error, please notify the sender by replying to this message and permanently delete this e-mail, its attachments, and any copies of it immediately. You should not retain, copy or use this e-mail or any attachment for any purpose, nor disclose all or any part of the contents to any other person. Thank you. Confidentiality Notice:: This email, including attachments, may include non-public, proprietary, confidential or legally privileged information. If you are not an intended recipient or an authorized agent of an intended recipient, you are hereby notified that any dissemination, distribution or copying of the information contained in or transmitted with this e-mail is unauthorized and strictly prohibited. If you have received this
RE: try to read multiple bz2 files in s3
Hi Robert, I just use textFile. Here is the simple code: val fs3File=sc.textFile("s3n://my bucket/myfolder/") fs3File.count do you suggest I should use sc.parallelize? many thanks From: Robert Collich [mailto:rcoll...@gmail.com] Sent: Monday, February 01, 2016 6:54 PM To: Lin, Hao; user Subject: Re: try to read multiple bz2 files in s3 Hi Hao, Could you please post the corresponding code? Are you using textFile or sc.parallelize? On Mon, Feb 1, 2016 at 2:36 PM Lin, Hao <hao@finra.org<mailto:hao@finra.org>> wrote: When I tried to read multiple bz2 files from s3, I have the following warning messages. What is the problem here? 16/02/01 22:30:30 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 10.162.67.248): java.lang.ArrayIndexOutOfBoundsException: -1844424343 at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode0(CBZip2InputStream.java:1014) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:829) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:504) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:333) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:399) at org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:483) at java.io.InputStream.read(InputStream.java:101) at org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader.fillBuffer(CompressedSplitLineReader.java:130) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) at org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader.readLine(CompressedSplitLineReader.java:159) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:248) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:216) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1553) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1125) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1125) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Confidentiality Notice:: This email, including attachments, may include non-public, proprietary, confidential or legally privileged information. If you are not an intended recipient or an authorized agent of an intended recipient, you are hereby notified that any dissemination, distribution or copying of the information contained in or transmitted with this e-mail is unauthorized and strictly prohibited. If you have received this email in error, please notify the sender by replying to this message and permanently delete this e-mail, its attachments, and any copies of it immediately. You should not retain, copy or use this e-mail or any attachment for any purpose, nor disclose all or any part of the contents to any other person. Thank you. Confidentiality Notice:: This email, including attachments, may include non-public, proprietary, confidential or legally privileged information. If you are not an intended recipient or an authorized agent of an intended recipient, you are hereby notified that any dissemination, distribution or copying of the information contained in or transmitted with this e-mail is unauthorized and strictly prohibited. If you have received this email in error, please notify the sender by replying to this message and permanently delete this e-mail, its attachments, and any copies of it immediately. You should not retain, copy or use this e-mail or any attachment for any purpose, nor disclose all or any part of the contents to any other person. Thank you.
try to read multiple bz2 files in s3
When I tried to read multiple bz2 files from s3, I have the following warning messages. What is the problem here? 16/02/01 22:30:30 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 10.162.67.248): java.lang.ArrayIndexOutOfBoundsException: -1844424343 at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode0(CBZip2InputStream.java:1014) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:829) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:504) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:333) at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:399) at org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:483) at java.io.InputStream.read(InputStream.java:101) at org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader.fillBuffer(CompressedSplitLineReader.java:130) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) at org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader.readLine(CompressedSplitLineReader.java:159) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:248) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:216) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1553) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1125) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1125) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Confidentiality Notice:: This email, including attachments, may include non-public, proprietary, confidential or legally privileged information. If you are not an intended recipient or an authorized agent of an intended recipient, you are hereby notified that any dissemination, distribution or copying of the information contained in or transmitted with this e-mail is unauthorized and strictly prohibited. If you have received this email in error, please notify the sender by replying to this message and permanently delete this e-mail, its attachments, and any copies of it immediately. You should not retain, copy or use this e-mail or any attachment for any purpose, nor disclose all or any part of the contents to any other person. Thank you.
SPARK_WORKER_INSTANCES deprecated
Can I still use SPARK_WORKER_INSTANCES in conf/spark-env.sh? the following is what I’ve got after trying to set this parameter and run spark-shell SPARK_WORKER_INSTANCES was detected (set to '32'). This is deprecated in Spark 1.0+. Please instead use: - ./spark-submit with --num-executors to specify the number of executors - Or set SPARK_EXECUTOR_INSTANCES - spark.executor.instances to configure the number of instances in the spark config. Confidentiality Notice:: This email, including attachments, may include non-public, proprietary, confidential or legally privileged information. If you are not an intended recipient or an authorized agent of an intended recipient, you are hereby notified that any dissemination, distribution or copying of the information contained in or transmitted with this e-mail is unauthorized and strictly prohibited. If you have received this email in error, please notify the sender by replying to this message and permanently delete this e-mail, its attachments, and any copies of it immediately. You should not retain, copy or use this e-mail or any attachment for any purpose, nor disclose all or any part of the contents to any other person. Thank you.
RE: SPARK_WORKER_INSTANCES deprecated
If you look at the Spark Doc, variable SPARK_WORKER_INSTANCES can still be specified but yet the SPARK_EXECUTOR_INSTANCES http://spark.apache.org/docs/1.5.2/spark-standalone.html From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Monday, February 01, 2016 5:45 PM To: Lin, Hao Cc: user Subject: Re: SPARK_WORKER_INSTANCES deprecated As the message (from SparkConf.scala) showed, you shouldn't use SPARK_WORKER_INSTANCES any more. FYI On Mon, Feb 1, 2016 at 2:19 PM, Lin, Hao <hao@finra.org<mailto:hao@finra.org>> wrote: Can I still use SPARK_WORKER_INSTANCES in conf/spark-env.sh? the following is what I’ve got after trying to set this parameter and run spark-shell SPARK_WORKER_INSTANCES was detected (set to '32'). This is deprecated in Spark 1.0+. Please instead use: - ./spark-submit with --num-executors to specify the number of executors - Or set SPARK_EXECUTOR_INSTANCES - spark.executor.instances to configure the number of instances in the spark config. Confidentiality Notice:: This email, including attachments, may include non-public, proprietary, confidential or legally privileged information. If you are not an intended recipient or an authorized agent of an intended recipient, you are hereby notified that any dissemination, distribution or copying of the information contained in or transmitted with this e-mail is unauthorized and strictly prohibited. If you have received this email in error, please notify the sender by replying to this message and permanently delete this e-mail, its attachments, and any copies of it immediately. You should not retain, copy or use this e-mail or any attachment for any purpose, nor disclose all or any part of the contents to any other person. Thank you. Confidentiality Notice:: This email, including attachments, may include non-public, proprietary, confidential or legally privileged information. If you are not an intended recipient or an authorized agent of an intended recipient, you are hereby notified that any dissemination, distribution or copying of the information contained in or transmitted with this e-mail is unauthorized and strictly prohibited. If you have received this email in error, please notify the sender by replying to this message and permanently delete this e-mail, its attachments, and any copies of it immediately. You should not retain, copy or use this e-mail or any attachment for any purpose, nor disclose all or any part of the contents to any other person. Thank you.
how to access local file from Spark sc.textFile("file:///path to/myfile")
Hi, I have problem accessing local file, with such example: sc.textFile("file:///root/2008.csv").count() with error: File file:/root/2008.csv does not exist. The file clearly exists since, since if I missed type the file name to an non-existing one, it will show: Error: Input path does not exist Please help! The following is the error message: scala> sc.textFile("file:///root/2008.csv").count() 15/12/11 17:12:08 WARN TaskSetManager: Lost task 15.0 in stage 8.0 (TID 498, 10.162.167.24): java.io.FileNotFoundException: File file:/root/2008.csv does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764) at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108) at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67) at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/12/11 17:12:08 ERROR TaskSetManager: Task 9 in stage 8.0 failed 4 times; aborting job org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in stage 8.0 failed 4 times, most recent failure: Lost task 9.3 in stage 8.0 (TID 547, 10.162.167.23): java.io.FileNotFoundException: File file:/root/2008.csv does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764) at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108) at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67) at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271) at
RE: how to access local file from Spark sc.textFile("file:///path to/myfile")
Here you go, thanks. -rw-r--r-- 1 root root 658M Dec 9 2014 /root/2008.csv From: Vijay Gharge [mailto:vijay.gha...@gmail.com] Sent: Friday, December 11, 2015 12:31 PM To: Lin, Hao Cc: user@spark.apache.org Subject: Re: how to access local file from Spark sc.textFile("file:///path to/myfile") Can you provide output of "ls -lh /root/2008.csv" ? On Friday 11 December 2015, Lin, Hao <hao@finra.org<mailto:hao@finra.org>> wrote: Hi, I have problem accessing local file, with such example: sc.textFile("file:///root/2008.csv").count() with error: File file:/root/2008.csv does not exist. The file clearly exists since, since if I missed type the file name to an non-existing one, it will show: Error: Input path does not exist Please help! The following is the error message: scala> sc.textFile("file:///root/2008.csv").count() 15/12/11 17:12:08 WARN TaskSetManager: Lost task 15.0 in stage 8.0 (TID 498, 10.162.167.24): java.io.FileNotFoundException: File file:/root/2008.csv does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764) at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108) at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67) at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/12/11 17:12:08 ERROR TaskSetManager: Task 9 in stage 8.0 failed 4 times; aborting job org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in stage 8.0 failed 4 times, most recent failure: Lost task 9.3 in stage 8.0 (TID 547, 10.162.167.23): java.io.FileNotFoundException: File file:/root/2008.csv does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764) at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108) at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67) at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
RE: how to access local file from Spark sc.textFile("file:///path to/myfile")
Yes to your question. I have spun up a cluster, login to the master as a root user, run spark-shell, and reference the local file of the master machine. From: Vijay Gharge [mailto:vijay.gha...@gmail.com] Sent: Friday, December 11, 2015 12:50 PM To: Lin, Hao Cc: user@spark.apache.org Subject: Re: how to access local file from Spark sc.textFile("file:///path to/myfile") One more question. Are you also running spark commands using root user ? Meanwhile am trying to simulate this locally. On Friday 11 December 2015, Lin, Hao <hao@finra.org<mailto:hao@finra.org>> wrote: Here you go, thanks. -rw-r--r-- 1 root root 658M Dec 9 2014 /root/2008.csv From: Vijay Gharge [mailto:vijay.gha...@gmail.com<javascript:_e(%7B%7D,'cvml','vijay.gha...@gmail.com');>] Sent: Friday, December 11, 2015 12:31 PM To: Lin, Hao Cc: user@spark.apache.org<javascript:_e(%7B%7D,'cvml','user@spark.apache.org');> Subject: Re: how to access local file from Spark sc.textFile("file:///path to/myfile") Can you provide output of "ls -lh /root/2008.csv" ? On Friday 11 December 2015, Lin, Hao <hao@finra.org<javascript:_e(%7B%7D,'cvml','hao@finra.org');>> wrote: Hi, I have problem accessing local file, with such example: sc.textFile("file:///root/2008.csv").count() with error: File file:/root/2008.csv does not exist. The file clearly exists since, since if I missed type the file name to an non-existing one, it will show: Error: Input path does not exist Please help! The following is the error message: scala> sc.textFile("file:///root/2008.csv").count() 15/12/11 17:12:08 WARN TaskSetManager: Lost task 15.0 in stage 8.0 (TID 498, 10.162.167.24): java.io.FileNotFoundException: File file:/root/2008.csv does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764) at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108) at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67) at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/12/11 17:12:08 ERROR TaskSetManager: Task 9 in stage 8.0 failed 4 times; aborting job org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in stage 8.0 failed 4 times, most recent failure: Lost task 9.3 in stage 8.0 (TID 547, 10.162.167.23): java.io.FileNotFoundException: File file:/root/2008.csv does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764) at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108) at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67) at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239) at org.apache.spark.rdd.HadoopRDD.compute(H
RE: how to access local file from Spark sc.textFile("file:///path to/myfile")
I logged into master of my cluster and referenced the local file of the master node machine. And yes that file only resides on master node, not on any of the remote workers. -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Friday, December 11, 2015 1:00 PM To: Lin, Hao Cc: user@spark.apache.org Subject: Re: how to access local file from Spark sc.textFile("file:///path to/myfile") Hm, are you referencing a local file from your remote workers? That won't work as the file only exists in one machine (I presume). On Fri, Dec 11, 2015 at 5:19 PM, Lin, Hao <hao@finra.org> wrote: > Hi, > > > > I have problem accessing local file, with such example: > > > > sc.textFile("file:///root/2008.csv").count() > > > > with error: File file:/root/2008.csv does not exist. > > The file clearly exists since, since if I missed type the file name to > an non-existing one, it will show: > > > > Error: Input path does not exist > > > > Please help! > > > > The following is the error message: > > > > scala> sc.textFile("file:///root/2008.csv").count() > > 15/12/11 17:12:08 WARN TaskSetManager: Lost task 15.0 in stage 8.0 > (TID 498, > 10.162.167.24): java.io.FileNotFoundException: File > file:/root/2008.csv does not exist > > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLoc > alFileSystem.java:511) > > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawL > ocalFileSystem.java:724) > > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSyst > em.java:501) > > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.j > ava:397) > > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.( > ChecksumFileSystem.java:137) > > at > org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:3 > 39) > > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764) > > at > org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java > :108) > > at > org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputForm > at.java:67) > > at > org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239) > > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216) > > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101) > > at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:3 > 8) > > at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > > at org.apache.spark.scheduler.Task.run(Task.scala:88) > > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j > ava:1145) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor. > java:615) > > at java.lang.Thread.run(Thread.java:745) > > > > 15/12/11 17:12:08 ERROR TaskSetManager: Task 9 in stage 8.0 failed 4 > times; aborting job > > org.apache.spark.SparkException: Job aborted due to stage failure: > Task 9 in stage 8.0 failed 4 times, most recent failure: Lost task 9.3 > in stage 8.0 (TID 547, 10.162.167.23): java.io.FileNotFoundException: > File file:/root/2008.csv does not exist > > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLoc > alFileSystem.java:511) > > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawL > ocalFileSystem.java:724) > > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSyst > em.java:501) > > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.j > ava:397) > > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.( > ChecksumFileSystem.java:137) > > at > org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:3 > 39) > > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764) > > at > org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java > :108) > > at > org.apache.hadoop.mapred.TextInputFormat.getRecordR
RE: Graph visualization tool for GraphX
Hi Andy, quick question, does Spark-Notebook include its own Spark engine, or I need to install Spark separately and point to it from Spark Notebook? thanks From: Lin, Hao [mailto:hao@finra.org] Sent: Tuesday, December 08, 2015 7:01 PM To: andy petrella; Jörn Franke Cc: user@spark.apache.org Subject: RE: Graph visualization tool for GraphX Thanks Andy, I certainly will give a try to your suggestion. From: andy petrella [mailto:andy.petre...@gmail.com] Sent: Tuesday, December 08, 2015 1:21 PM To: Lin, Hao; Jörn Franke Cc: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: Graph visualization tool for GraphX Hello Lin, This is indeed a tough scenario when you have many vertices (and even worst) many edges... So two-fold answer: First, technically, there is a graph plotting support in the spark notebook (https://github.com/andypetrella/spark-notebook/[github.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_andypetrella_spark-2Dnotebook_=CwMFaQ=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=hEbdsAy--QyqEffI1SO48sF4L8qAUh-2ABY1GA8lJ1s=btftZ-dWpn030poyyZmVMqv46oKPca3dR8InALUt_FI=> → check this notebook: https://github.com/andypetrella/spark-notebook/blob/master/notebooks/viz/Graph%20Plots.snb[github.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_andypetrella_spark-2Dnotebook_blob_master_notebooks_viz_Graph-2520Plots.snb=CwMFaQ=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=hEbdsAy--QyqEffI1SO48sF4L8qAUh-2ABY1GA8lJ1s=Ps4dk3ePteW7s9712REWtNXVtroxc_0S7apFyJni5lU=>). You can plot graph from scala, which will convert to D3 with force layout force field. The number or the points which you will plot are "sampled" using a `Sampler` that you can provide yourself. Which leads to the second fold of this answer. Plotting a large graph is rather tough because there is no real notion of dimension... there is always the option to dig the topological analysis theory to find good homeomorphism ... but won't be that efficient ;-D. Best is to find a good approach to generalize/summarize the information, there are many many techniques (that you can find in mainly geospatial viz and biology viz theories...) Best is to check what will match your need the fastest. There are quick techniques like using unsupervised clustering models and then plot a voronoi diagram (which can be approached using force layout). In general term I might say that multiscaling is intuitively what you want first: this is an interesting paper presenting the foundations: https://www.cs.ubc.ca/~tmm/courses/533-07/readings/auberIV03Seattle.pdf[cs.ubc.ca]<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cs.ubc.ca_-7Etmm_courses_533-2D07_readings_auberIV03Seattle.pdf=CwMFaQ=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=hEbdsAy--QyqEffI1SO48sF4L8qAUh-2ABY1GA8lJ1s=N3p_GQ2tUHGQ6sjyYfAfg2UcfC1mqfGDaEWlHS5VeCs=> Oh and BTW, to end this longish mail, while looking for new papers on that, I felt on this one: http://vacommunity.org/egas2015/papers/IGAS2015-ScottLangevin.pdf[vacommunity.org]<https://urldefense.proofpoint.com/v2/url?u=http-3A__vacommunity.org_egas2015_papers_IGAS2015-2DScottLangevin.pdf=CwMFaQ=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=hEbdsAy--QyqEffI1SO48sF4L8qAUh-2ABY1GA8lJ1s=r02nBLxV_-lc996UtERk8VfxQ4eFMRU9dzqHq9Fhtyo=> which is using 1. Spark !!! 2. a tile based approach (~ to tiling + pyramids in geospatial) HTH PS regarding the Spark Notebook, you can always come and discuss on gitter: https://gitter.im/andypetrella/spark-notebook[gitter.im]<https://urldefense.proofpoint.com/v2/url?u=https-3A__gitter.im_andypetrella_spark-2Dnotebook=CwMFaQ=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=hEbdsAy--QyqEffI1SO48sF4L8qAUh-2ABY1GA8lJ1s=PuYO74CXBeRxGrdoe4TmK9ezEZfQWYN4bMLcZAJ12iE=> On Tue, Dec 8, 2015 at 6:30 PM Lin, Hao <hao@finra.org<mailto:hao@finra.org>> wrote: Hello Jorn, Thank you for the reply and being tolerant of my over simplified question. I should’ve been more specific. Though ~TB of data, there will be about billions of records (edges) and 100,000 nodes. We need to visualize the social networks graph like what can be done by Gephi which has limitation on scalability to handle such amount of data. There will be dozens of users to access and the response time is also critical. We would like to run the visualization tool on the remote ec2 server where webtool can be a good choice for us. Please let me know if I need to be more specific ☺. Thanks hao From: Jörn Franke [mailto:jornfra...@gmail.com<mailto:jornfra...@gmail.com>] Sent: Tuesday, December 08, 2015 11:31 AM To: Lin, Hao Cc: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: Graph visualization tool for GraphX I am not sure about your use case. How should a human interpret ma
Graph visualization tool for GraphX
Hi, Anyone can recommend a great Graph visualization tool for GraphX that can handle truly large Data (~ TB) ? Thanks so much Hao Confidentiality Notice:: This email, including attachments, may include non-public, proprietary, confidential or legally privileged information. If you are not an intended recipient or an authorized agent of an intended recipient, you are hereby notified that any dissemination, distribution or copying of the information contained in or transmitted with this e-mail is unauthorized and strictly prohibited. If you have received this email in error, please notify the sender by replying to this message and permanently delete this e-mail, its attachments, and any copies of it immediately. You should not retain, copy or use this e-mail or any attachment for any purpose, nor disclose all or any part of the contents to any other person. Thank you.
RE: Graph visualization tool for GraphX
Hello Jorn, Thank you for the reply and being tolerant of my over simplified question. I should’ve been more specific. Though ~TB of data, there will be about billions of records (edges) and 100,000 nodes. We need to visualize the social networks graph like what can be done by Gephi which has limitation on scalability to handle such amount of data. There will be dozens of users to access and the response time is also critical. We would like to run the visualization tool on the remote ec2 server where webtool can be a good choice for us. Please let me know if I need to be more specific ☺. Thanks hao From: Jörn Franke [mailto:jornfra...@gmail.com] Sent: Tuesday, December 08, 2015 11:31 AM To: Lin, Hao Cc: user@spark.apache.org Subject: Re: Graph visualization tool for GraphX I am not sure about your use case. How should a human interpret many terabytes of data in one large visualization?? You have to be more specific, what part of the data needs to be visualized, what kind of visualization, what navigation do you expect within the visualisation, how many users, response time, web tool vs mobile vs Desktop etc On 08 Dec 2015, at 16:46, Lin, Hao <hao@finra.org<mailto:hao@finra.org>> wrote: Hi, Anyone can recommend a great Graph visualization tool for GraphX that can handle truly large Data (~ TB) ? Thanks so much Hao Confidentiality Notice:: This email, including attachments, may include non-public, proprietary, confidential or legally privileged information. If you are not an intended recipient or an authorized agent of an intended recipient, you are hereby notified that any dissemination, distribution or copying of the information contained in or transmitted with this e-mail is unauthorized and strictly prohibited. If you have received this email in error, please notify the sender by replying to this message and permanently delete this e-mail, its attachments, and any copies of it immediately. You should not retain, copy or use this e-mail or any attachment for any purpose, nor disclose all or any part of the contents to any other person. Thank you. Confidentiality Notice:: This email, including attachments, may include non-public, proprietary, confidential or legally privileged information. If you are not an intended recipient or an authorized agent of an intended recipient, you are hereby notified that any dissemination, distribution or copying of the information contained in or transmitted with this e-mail is unauthorized and strictly prohibited. If you have received this email in error, please notify the sender by replying to this message and permanently delete this e-mail, its attachments, and any copies of it immediately. You should not retain, copy or use this e-mail or any attachment for any purpose, nor disclose all or any part of the contents to any other person. Thank you.
RE: Graph visualization tool for GraphX
Thanks Andy, I certainly will give a try to your suggestion. From: andy petrella [mailto:andy.petre...@gmail.com] Sent: Tuesday, December 08, 2015 1:21 PM To: Lin, Hao; Jörn Franke Cc: user@spark.apache.org Subject: Re: Graph visualization tool for GraphX Hello Lin, This is indeed a tough scenario when you have many vertices (and even worst) many edges... So two-fold answer: First, technically, there is a graph plotting support in the spark notebook (https://github.com/andypetrella/spark-notebook/[github.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_andypetrella_spark-2Dnotebook_=CwMFaQ=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=hEbdsAy--QyqEffI1SO48sF4L8qAUh-2ABY1GA8lJ1s=btftZ-dWpn030poyyZmVMqv46oKPca3dR8InALUt_FI=> → check this notebook: https://github.com/andypetrella/spark-notebook/blob/master/notebooks/viz/Graph%20Plots.snb[github.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_andypetrella_spark-2Dnotebook_blob_master_notebooks_viz_Graph-2520Plots.snb=CwMFaQ=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=hEbdsAy--QyqEffI1SO48sF4L8qAUh-2ABY1GA8lJ1s=Ps4dk3ePteW7s9712REWtNXVtroxc_0S7apFyJni5lU=>). You can plot graph from scala, which will convert to D3 with force layout force field. The number or the points which you will plot are "sampled" using a `Sampler` that you can provide yourself. Which leads to the second fold of this answer. Plotting a large graph is rather tough because there is no real notion of dimension... there is always the option to dig the topological analysis theory to find good homeomorphism ... but won't be that efficient ;-D. Best is to find a good approach to generalize/summarize the information, there are many many techniques (that you can find in mainly geospatial viz and biology viz theories...) Best is to check what will match your need the fastest. There are quick techniques like using unsupervised clustering models and then plot a voronoi diagram (which can be approached using force layout). In general term I might say that multiscaling is intuitively what you want first: this is an interesting paper presenting the foundations: https://www.cs.ubc.ca/~tmm/courses/533-07/readings/auberIV03Seattle.pdf[cs.ubc.ca]<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cs.ubc.ca_-7Etmm_courses_533-2D07_readings_auberIV03Seattle.pdf=CwMFaQ=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=hEbdsAy--QyqEffI1SO48sF4L8qAUh-2ABY1GA8lJ1s=N3p_GQ2tUHGQ6sjyYfAfg2UcfC1mqfGDaEWlHS5VeCs=> Oh and BTW, to end this longish mail, while looking for new papers on that, I felt on this one: http://vacommunity.org/egas2015/papers/IGAS2015-ScottLangevin.pdf[vacommunity.org]<https://urldefense.proofpoint.com/v2/url?u=http-3A__vacommunity.org_egas2015_papers_IGAS2015-2DScottLangevin.pdf=CwMFaQ=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=hEbdsAy--QyqEffI1SO48sF4L8qAUh-2ABY1GA8lJ1s=r02nBLxV_-lc996UtERk8VfxQ4eFMRU9dzqHq9Fhtyo=> which is using 1. Spark !!! 2. a tile based approach (~ to tiling + pyramids in geospatial) HTH PS regarding the Spark Notebook, you can always come and discuss on gitter: https://gitter.im/andypetrella/spark-notebook[gitter.im]<https://urldefense.proofpoint.com/v2/url?u=https-3A__gitter.im_andypetrella_spark-2Dnotebook=CwMFaQ=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=hEbdsAy--QyqEffI1SO48sF4L8qAUh-2ABY1GA8lJ1s=PuYO74CXBeRxGrdoe4TmK9ezEZfQWYN4bMLcZAJ12iE=> On Tue, Dec 8, 2015 at 6:30 PM Lin, Hao <hao@finra.org<mailto:hao@finra.org>> wrote: Hello Jorn, Thank you for the reply and being tolerant of my over simplified question. I should’ve been more specific. Though ~TB of data, there will be about billions of records (edges) and 100,000 nodes. We need to visualize the social networks graph like what can be done by Gephi which has limitation on scalability to handle such amount of data. There will be dozens of users to access and the response time is also critical. We would like to run the visualization tool on the remote ec2 server where webtool can be a good choice for us. Please let me know if I need to be more specific ☺. Thanks hao From: Jörn Franke [mailto:jornfra...@gmail.com<mailto:jornfra...@gmail.com>] Sent: Tuesday, December 08, 2015 11:31 AM To: Lin, Hao Cc: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: Graph visualization tool for GraphX I am not sure about your use case. How should a human interpret many terabytes of data in one large visualization?? You have to be more specific, what part of the data needs to be visualized, what kind of visualization, what navigation do you expect within the visualisation, how many users, response time, web tool vs mobile vs Desktop etc On 08 Dec 2015, at 16:46, Lin, Hao <hao@finra.org<mailto:hao@finra.org>> wrote: Hi,
Is Temporary Access Credential (AccessKeyId, SecretAccessKey + SecurityToken) support by Spark?
Hi, Does anyone knows if Spark run in AWS is supported by temporary access credential (AccessKeyId, SecretAccessKey + SecurityToken) to access S3? I only see references to specify fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey, without any mention of security token. Apparently this is only for static credential. Many thanks Confidentiality Notice:: This email, including attachments, may include non-public, proprietary, confidential or legally privileged information. If you are not an intended recipient or an authorized agent of an intended recipient, you are hereby notified that any dissemination, distribution or copying of the information contained in or transmitted with this e-mail is unauthorized and strictly prohibited. If you have received this email in error, please notify the sender by replying to this message and permanently delete this e-mail, its attachments, and any copies of it immediately. You should not retain, copy or use this e-mail or any attachment for any purpose, nor disclose all or any part of the contents to any other person. Thank you.
RE: Is Temporary Access Credential (AccessKeyId, SecretAccessKey + SecurityToken) support by Spark?
Thanks, I will keep an eye on it. From: Michal Klos [mailto:michal.klo...@gmail.com] Sent: Friday, December 04, 2015 1:50 PM To: Lin, Hao Cc: user Subject: Re: Is Temporary Access Credential (AccessKeyId, SecretAccessKey + SecurityToken) support by Spark? We were looking into this as well --- the answer looks like "no" Here's the ticket: https://issues.apache.org/jira/browse/HADOOP-9680[issues.apache.org]<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D9680=CwMFaQ=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=VNKYE8PBt9wiTeKo0-hzMttXYOVJbMUfFkgZlSKOP5s=OJAcssexvkFXTfzo7tu1hLc8VAFhQCVWHcO7brCa8Rs=> m On Fri, Dec 4, 2015 at 1:41 PM, Lin, Hao <hao@finra.org<mailto:hao@finra.org>> wrote: Hi, Does anyone knows if Spark run in AWS is supported by temporary access credential (AccessKeyId, SecretAccessKey + SecurityToken) to access S3? I only see references to specify fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey, without any mention of security token. Apparently this is only for static credential. Many thanks Confidentiality Notice:: This email, including attachments, may include non-public, proprietary, confidential or legally privileged information. If you are not an intended recipient or an authorized agent of an intended recipient, you are hereby notified that any dissemination, distribution or copying of the information contained in or transmitted with this e-mail is unauthorized and strictly prohibited. If you have received this email in error, please notify the sender by replying to this message and permanently delete this e-mail, its attachments, and any copies of it immediately. You should not retain, copy or use this e-mail or any attachment for any purpose, nor disclose all or any part of the contents to any other person. Thank you. Confidentiality Notice:: This email, including attachments, may include non-public, proprietary, confidential or legally privileged information. If you are not an intended recipient or an authorized agent of an intended recipient, you are hereby notified that any dissemination, distribution or copying of the information contained in or transmitted with this e-mail is unauthorized and strictly prohibited. If you have received this email in error, please notify the sender by replying to this message and permanently delete this e-mail, its attachments, and any copies of it immediately. You should not retain, copy or use this e-mail or any attachment for any purpose, nor disclose all or any part of the contents to any other person. Thank you.
RE: starting spark-shell throws /tmp/hive on HDFS should be writable error
Mich, did you run this locally or on EC2 (I use EC2)? Is this problem universal or specific to, say EC2? Many thanks From: Mich Talebzadeh [mailto:m...@peridale.co.uk] Sent: Wednesday, December 02, 2015 5:01 PM To: Lin, Hao; user@spark.apache.org Subject: RE: starting spark-shell throws /tmp/hive on HDFS should be writable error Hi, Actually I went back to 1.3, 1.3.1 and 1.4 and built spark from source code with no luck. So I am not sure if any good result is going to come from back tracking? Cheers, Mich Talebzadeh Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf[login.sybase.com]<https://urldefense.proofpoint.com/v2/url?u=http-3A__login.sybase.com_files_Product-5FOverviews_ASE-2DWinning-2DStrategy-2D091908.pdf=CwMFAg=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=QinEV1-DOnM0qRbga2o_-IP-6cPniwts0F8Rk3CmpCQ=S876LfQmjdqIc4Wh6_spNAYyN6n6pX1ltzZp3VuSb-s=> Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly http://talebzadehmich.wordpress.com[talebzadehmich.wordpress.com]<https://urldefense.proofpoint.com/v2/url?u=http-3A__talebzadehmich.wordpress.com_=CwMFAg=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=QinEV1-DOnM0qRbga2o_-IP-6cPniwts0F8Rk3CmpCQ=IupCrV9sVc16ecw3H1ecZZ3jzcEP04V5cTuHS_m46Nw=> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility. From: Lin, Hao [mailto:hao@finra.org] Sent: 02 December 2015 21:49 To: Mich Talebzadeh <m...@peridale.co.uk<mailto:m...@peridale.co.uk>>; user@spark.apache.org<mailto:user@spark.apache.org> Subject: RE: starting spark-shell throws /tmp/hive on HDFS should be writable error I also have the same problem on my side using version 1.5.0. Just wonder if anyone has any update on this. Should I go back to the earlier version, like 1.4.0? thanks From: Mich Talebzadeh [mailto:m...@peridale.co.uk] Sent: Friday, November 20, 2015 5:43 PM To: user@spark.apache.org<mailto:user@spark.apache.org> Subject: FW: starting spark-shell throws /tmp/hive on HDFS should be writable error From: Mich Talebzadeh [mailto:m...@peridale.co.uk] Sent: 20 November 2015 21:14 To: u...@hive.apache.org<mailto:u...@hive.apache.org> Subject: starting spark-shell throws /tmp/hive on HDFS should be writable error Hi, Has this been resolved. I don't think this has anything to do with /tmp/hive directory permission spark-shell log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig[logging.apache.org]<https://urldefense.proofpoint.com/v2/url?u=http-3A__logging.apache.org_log4j_1.2_faq.html-23noconfig=CwMFAg=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=oHOKhu2DKTwCpcPeTgb9VszWiqFOXGY6GF4SGjoqEy4=W2rVckqSVOz4yoVE5AE9_-zp32w4jlYKl4E5YGWs5IM=> for more info. Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties To adjust logging level use sc.setLogLevel("INFO") Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.2 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_25) Type in expressions to have them evaluated. Type :help for more information. java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwx-- :10: error: not found: value sqlContext import sqlContext.implicits._ ^ :10: error: not found: value sqlContext import sqlContext.sql ^ scala> Thanks, Mich Talebzadeh Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf[login.sybase.com]<https://urldefense.proofpoint.com/v2/url?u=http-3A__log
RE: starting spark-shell throws /tmp/hive on HDFS should be writable error
I actually don't have the folder /tmp/hive created in my master node, is that a problem? From: Mich Talebzadeh [mailto:m...@peridale.co.uk] Sent: Wednesday, December 02, 2015 5:40 PM To: Lin, Hao; user@spark.apache.org Subject: RE: starting spark-shell throws /tmp/hive on HDFS should be writable error Locally thx Mich Talebzadeh Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf[login.sybase.com]<https://urldefense.proofpoint.com/v2/url?u=http-3A__login.sybase.com_files_Product-5FOverviews_ASE-2DWinning-2DStrategy-2D091908.pdf=CwMFAg=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=DLEHMSw6P5G-otI4X-xe4mn-W7R24w9_jtduSgqjrrY=X_AATP-s1XbX6zGrchbqdjRzDHVgru2rfijtkHkpc_M=> Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly http://talebzadehmich.wordpress.com[talebzadehmich.wordpress.com]<https://urldefense.proofpoint.com/v2/url?u=http-3A__talebzadehmich.wordpress.com_=CwMFAg=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=DLEHMSw6P5G-otI4X-xe4mn-W7R24w9_jtduSgqjrrY=Izk8KEsMf-1SYXm0YwQXXKfiZf1ZsMtc5ARC0SbRqZE=> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility. From: Lin, Hao [mailto:hao@finra.org] Sent: 02 December 2015 22:24 To: Mich Talebzadeh <m...@peridale.co.uk<mailto:m...@peridale.co.uk>>; user@spark.apache.org<mailto:user@spark.apache.org> Subject: RE: starting spark-shell throws /tmp/hive on HDFS should be writable error Mich, did you run this locally or on EC2 (I use EC2)? Is this problem universal or specific to, say EC2? Many thanks From: Mich Talebzadeh [mailto:m...@peridale.co.uk] Sent: Wednesday, December 02, 2015 5:01 PM To: Lin, Hao; user@spark.apache.org<mailto:user@spark.apache.org> Subject: RE: starting spark-shell throws /tmp/hive on HDFS should be writable error Hi, Actually I went back to 1.3, 1.3.1 and 1.4 and built spark from source code with no luck. So I am not sure if any good result is going to come from back tracking? Cheers, Mich Talebzadeh Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf[login.sybase.com]<https://urldefense.proofpoint.com/v2/url?u=http-3A__login.sybase.com_files_Product-5FOverviews_ASE-2DWinning-2DStrategy-2D091908.pdf=CwMFAg=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=QinEV1-DOnM0qRbga2o_-IP-6cPniwts0F8Rk3CmpCQ=S876LfQmjdqIc4Wh6_spNAYyN6n6pX1ltzZp3VuSb-s=> Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly http://talebzadehmich.wordpress.com[talebzadehmich.wordpress.com]<https://urldefense.proofpoint.com/v2/url?u=http-3A__talebzadehmich.wordpress.com_=CwMFAg=XK1GVu0Y2HvWRiFNJ9Hesw=uIybaSSiVvR1Uni2EecKYCQDa6UHqV0QDbyaKNjHuMM=QinEV1-DOnM0qRbga2o_-IP-6cPniwts0F8Rk3CmpCQ=IupCrV9sVc16ecw3H1ecZZ3jzcEP04V5cTuHS_m46Nw=> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility. From: Lin, Hao [mailto:hao@finra.org] Sent: 02 December 2015 21:49 To: Mich Talebzadeh <m...@peridale.co.uk<mailto:m...@peridale.co.uk>>; user@spark.apache.org<mailto:user@spark.apache.org> Subject: RE: s
Re: The Processing loading of Spark streaming on YARN is not in balance
It seems that the data size is only 2.9MB, far less than the default rdd size. How about put more data into kafka? and what about the number of topic partitions from kafka? Best regards, Lin Hao XU IBM Research China Email: xulin...@cn.ibm.com My Flickr: http://www.flickr.com/photos/xulinhao/sets From: Kyle Lin kylelin2...@gmail.com To: user@spark.apache.org user@spark.apache.org Date: 2015/04/30 14:39 Subject:The Processing loading of Spark streaming on YARN is not in balance Hi all My environment info Hadoop release version: HDP 2.1 Kakfa: 0.8.1.2.1.4.0 Spark: 1.1.0 My question: I ran Spark streaming program on YARN. My Spark streaming program will read data from Kafka and doing some processing. But, I found there is always only ONE executor under processing. As following table, I had 10 work executors, but only Executor No.5 is running now. And all RRD Blocks are in Exectuors No.5. For my situation, the processing looks not distributed. How can I make all executors runs together. |+-+++---+-+-+---++---+--+-+-| |Executor| Address | RDD |Memory Used | Disk | Active | Failed | Complete | Total | Task |Input | Shuffle | Shuffle | | ID | | Blocks || Used | Tasks | Tasks | Tasks | Tasks | Time | | Read | Write | |+-+++---+-+-+---++---+--+-+-| |1 |slave03:38662|0 |0.0 B / |0.0 B |0|0|58 |58 |24.9 s |0.0 B |1197.0 B |440.5 KB | || ||265.4 MB| | | | || | | | | |+-+++---+-+-+---++---+--+-+-| |10 |slave05:36992|0 |0.0 B / |0.0 B |0|0|44 |44 |30.9 s |0.0 B |0.0 B|501.7 KB | || ||265.4 MB| | | | || | | | | |+-+++---+-+-+---++---+--+-+-| |2 |slave02:40250|0 |0.0 B / |0.0 B |0|0|30 |30 |19.6 s |0.0 B |855.0 B |1026.0 B | || ||265.4 MB| | | | || | | | | |+-+++---+-+-+---++---+--+-+-| |3 |slave01:40882|0 |0.0 B / |0.0 B |0|0|28 |28 |20.7 s |0.0 B |1197.0 B |1026.0 B | || ||265.4 MB| | | | || | | | | |+-+++---+-+-+---++---+--+-+-| |4 |slave04:57068|0 |0.0 B / |0.0 B |0|0|29 |29 |20.9 s |0.0 B |1368.0 B |1026.0 B | || ||265.4 MB| | | | || | | | | |+-+++---+-+-+---++---+--+-+-| |5 |slave05:40191|23 |2.9 MB /|0.0 B |1|0|3928 |3929|5.1 m |2.7 MB|261.7 KB |564.4 KB | || ||265.4 MB| | | | || | | | | |+-+++---+-+-+---++---+--+-+-| |6 |slave03:35515|0 |0.0 B / |0.0 B |1|0|47 |48 |23.1 s |0.0 B |513.0 B |400.4 KB | || ||265.4 MB| | | | || | | | | |+-+++---+-+-+---++---+--+-+-| |7 |slave02:40325|0 |0.0 B / |0.0 B |0|0|30 |30 |20.2 s |0.0 B |855.0 B |1197.0 B | || ||265.4 MB| | | | || | | | | |+-+++---+-+-+---++---+--+-+-| |8 |slave01:48609|0 |0.0 B / |0.0 B |0|0|28 |28 |20.5 s |0.0 B |1363.0 B |1026.0 B | || ||265.4 MB
Re: Re: implicit function in SparkStreaming
For you question, I think the discussion in this link can help. http://apache-spark-user-list.1001560.n3.nabble.com/Error-related-to-serialisation-in-spark-streaming-td6801.html Best regards, Lin Hao XU IBM Research China Email: xulin...@cn.ibm.com My Flickr: http://www.flickr.com/photos/xulinhao/sets From: guoqing0...@yahoo.com.hk guoqing0...@yahoo.com.hk To: Tathagata Das t...@databricks.com Cc: user user@spark.apache.org Date: 2015/04/30 11:17 Subject:Re: Re: implicit function in SparkStreaming Appreciate for your help , it works . i`m curious why the enclosing class cannot serialized , is it need to extends java.io.Serializable ? if object never serialized how it works in the task .whether there`s any association with the spark.closure.serializer . guoqing0...@yahoo.com.hk From: Tathagata Das Date: 2015-04-30 09:30 To: guoqing0...@yahoo.com.hk CC: user Subject: Re: Re: implicit function in SparkStreaming Could you put the implicit def in an object? That should work, as objects are never serialized. On Wed, Apr 29, 2015 at 6:28 PM, guoqing0...@yahoo.com.hk guoqing0...@yahoo.com.hk wrote: Thank you for your pointers , it`s very helpful to me , in this scenario how can i use the implicit def in the enclosing class ? From: Tathagata Das Date: 2015-04-30 07:00 To: guoqing0...@yahoo.com.hk CC: user Subject: Re: implicit function in SparkStreaming I believe that the implicit def is pulling in the enclosing class (in which the def is defined) in the closure which is not serializable. On Wed, Apr 29, 2015 at 4:20 AM, guoqing0...@yahoo.com.hk guoqing0...@yahoo.com.hk wrote: Hi guys, I`m puzzled why i cant use the implicit function in spark streaming to cause Task not serializable . code snippet: implicit final def str2KeyValue(s:String): (String,String) = { val message = s.split(\\|) if(message.length = 2) (message(0),message(1)) else if(message.length == 1) { (message(0), ) } else (,) } def filter(stream:DStream[String]) :DStream[String] = { stream.filter(s = { (s._1==Action s._2==TRUE) }) Could you please give me some pointers ? Thank you . guoqing0...@yahoo.com.hk
Re: A problem of using spark streaming to capture network packets
1. The full command line is written in a shell script: LIB=/home/spark/.m2/repository /opt/spark/bin/spark-submit \ --class spark.pcap.run.TestPcapSpark \ --jars $LIB/org/pcap4j/pcap4j-core/1.4.0/pcap4j-core-1.4.0.jar,$LIB/org/pcap4j/pcap4j-packetfactory-static/1.4.0/pcap4j-packetfactory-static-1.4.0.jar,$LIB/ org/slf4j/slf4j-api/1.7.6/slf4j-api-1.7.6.jar,$LIB/org/slf4j/slf4j-log4j12/1.7.6/slf4j-log4j12-1.7.6.jar,$LIB/net/java/dev/jna/jna/4.1.0/jna-4.1.0.jar \ /home/spark/napa/napa.jar 2. And we run this script with sudo, if you do not use sudo, then you cannot access network interface. 3. We also tested ListPcapNetworkInterface nifs = Pcaps.findAllDevs() in a standard Java program, it really worked like a champion. Best regards, Lin Hao XU IBM Research China Email: xulin...@cn.ibm.com My Flickr: http://www.flickr.com/photos/xulinhao/sets From: Dean Wampler deanwamp...@gmail.com To: Hai Shan Wu/China/IBM@IBMCN Cc: user user@spark.apache.org, Lin Hao Xu/China/IBM@IBMCN Date: 2015/04/28 20:07 Subject:Re: A problem of using spark streaming to capture network packets It's probably not your code. What's the full command line you use to submit the job? Are you sure the job on the cluster has access to the network interface? Can you test the receiver by itself without Spark? For example, does this line work as expected: ListPcapNetworkInterface nifs = Pcaps.findAllDevs(); dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe @deanwampler http://polyglotprogramming.com On Mon, Apr 27, 2015 at 4:03 AM, Hai Shan Wu wuh...@cn.ibm.com wrote: Hi Everyone We use pcap4j to capture network packets and then use spark streaming to analyze captured packets. However, we met a strange problem. If we run our application on spark locally (for example, spark-submit --master local[2]), then the program runs successfully. If we run our application on spark standalone cluster, then the program will tell us that NO NIFs found. I also attach two test files for clarification. So anyone can help on this? Thanks in advance! (See attached file: PcapReceiver.java)(See attached file: TestPcapSpark.java) Best regards, - Haishan Haishan Wu (吴海珊) IBM Research - China Tel: 86-10-58748508 Fax: 86-10-58748330 Email: wuh...@cn.ibm.com Lotus Notes: Hai Shan Wu/China/IBM - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: A problem of using spark streaming to capture network packets
btw, from spark web ui, the acl is marked with root Best regards, Lin Hao XU IBM Research China Email: xulin...@cn.ibm.com My Flickr: http://www.flickr.com/photos/xulinhao/sets From: Dean Wampler deanwamp...@gmail.com To: Lin Hao Xu/China/IBM@IBMCN Cc: Hai Shan Wu/China/IBM@IBMCN, user user@spark.apache.org Date: 2015/04/29 09:40 Subject:Re: A problem of using spark streaming to capture network packets Are the tasks on the slaves also running as root? If not, that might explain the problem. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe @deanwampler http://polyglotprogramming.com On Tue, Apr 28, 2015 at 8:30 PM, Lin Hao Xu xulin...@cn.ibm.com wrote: 1. The full command line is written in a shell script: LIB=/home/spark/.m2/repository /opt/spark/bin/spark-submit \ --class spark.pcap.run.TestPcapSpark \ --jars $LIB/org/pcap4j/pcap4j-core/1.4.0/pcap4j-core-1.4.0.jar,$LIB/org/pcap4j/pcap4j-packetfactory-static/1.4.0/pcap4j-packetfactory-static-1.4.0.jar,$LIB/ org/slf4j/slf4j-api/1.7.6/slf4j-api-1.7.6.jar,$LIB/org/slf4j/slf4j-log4j12/1.7.6/slf4j-log4j12-1.7.6.jar,$LIB/net/java/dev/jna/jna/4.1.0/jna-4.1.0.jar \ /home/spark/napa/napa.jar 2. And we run this script with sudo, if you do not use sudo, then you cannot access network interface. 3. We also tested ListPcapNetworkInterface nifs = Pcaps.findAllDevs() in a standard Java program, it really worked like a champion. Best regards, Lin Hao XU IBM Research China Email: xulin...@cn.ibm.com My Flickr: http://www.flickr.com/photos/xulinhao/sets Inactive hide details for Dean Wampler ---2015/04/28 20:07:54---It's probably not your code. What's the full command line you uDean Wampler ---2015/04/28 20:07:54---It's probably not your code. What's the full command line you use to submit the job? From: Dean Wampler deanwamp...@gmail.com To: Hai Shan Wu/China/IBM@IBMCN Cc: user user@spark.apache.org, Lin Hao Xu/China/IBM@IBMCN Date: 2015/04/28 20:07 Subject: Re: A problem of using spark streaming to capture network packets It's probably not your code. What's the full command line you use to submit the job? Are you sure the job on the cluster has access to the network interface? Can you test the receiver by itself without Spark? For example, does this line work as expected: ListPcapNetworkInterface nifs = Pcaps.findAllDevs(); dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe @deanwampler http://polyglotprogramming.com On Mon, Apr 27, 2015 at 4:03 AM, Hai Shan Wu wuh...@cn.ibm.com wrote: Hi Everyone We use pcap4j to capture network packets and then use spark streaming to analyze captured packets. However, we met a strange problem. If we run our application on spark locally (for example, spark-submit --master local[2]), then the program runs successfully. If we run our application on spark standalone cluster, then the program will tell us that NO NIFs found. I also attach two test files for clarification. So anyone can help on this? Thanks in advance! (See attached file: PcapReceiver.java)(See attached file: TestPcapSpark.java) Best regards, - Haishan Haishan Wu (吴海珊) IBM Research - China Tel: 86-10-58748508 Fax: 86-10-58748330 Email: wuh...@cn.ibm.com Lotus Notes: Hai Shan Wu/China/IBM - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: A problem of using spark streaming to capture network packets
Actually, to simplify this problem, we run our program on a single machine with 4 slave workers. Since on a single machine, I think all slave workers are ran with root privilege. BTW, if we have a cluster, how to make sure slaves on remote machines run program as root? Best regards, Lin Hao XU IBM Research China Email: xulin...@cn.ibm.com My Flickr: http://www.flickr.com/photos/xulinhao/sets From: Dean Wampler deanwamp...@gmail.com To: Lin Hao Xu/China/IBM@IBMCN Cc: Hai Shan Wu/China/IBM@IBMCN, user user@spark.apache.org Date: 2015/04/29 09:40 Subject:Re: A problem of using spark streaming to capture network packets Are the tasks on the slaves also running as root? If not, that might explain the problem. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe @deanwampler http://polyglotprogramming.com On Tue, Apr 28, 2015 at 8:30 PM, Lin Hao Xu xulin...@cn.ibm.com wrote: 1. The full command line is written in a shell script: LIB=/home/spark/.m2/repository /opt/spark/bin/spark-submit \ --class spark.pcap.run.TestPcapSpark \ --jars $LIB/org/pcap4j/pcap4j-core/1.4.0/pcap4j-core-1.4.0.jar,$LIB/org/pcap4j/pcap4j-packetfactory-static/1.4.0/pcap4j-packetfactory-static-1.4.0.jar,$LIB/ org/slf4j/slf4j-api/1.7.6/slf4j-api-1.7.6.jar,$LIB/org/slf4j/slf4j-log4j12/1.7.6/slf4j-log4j12-1.7.6.jar,$LIB/net/java/dev/jna/jna/4.1.0/jna-4.1.0.jar \ /home/spark/napa/napa.jar 2. And we run this script with sudo, if you do not use sudo, then you cannot access network interface. 3. We also tested ListPcapNetworkInterface nifs = Pcaps.findAllDevs() in a standard Java program, it really worked like a champion. Best regards, Lin Hao XU IBM Research China Email: xulin...@cn.ibm.com My Flickr: http://www.flickr.com/photos/xulinhao/sets Inactive hide details for Dean Wampler ---2015/04/28 20:07:54---It's probably not your code. What's the full command line you uDean Wampler ---2015/04/28 20:07:54---It's probably not your code. What's the full command line you use to submit the job? From: Dean Wampler deanwamp...@gmail.com To: Hai Shan Wu/China/IBM@IBMCN Cc: user user@spark.apache.org, Lin Hao Xu/China/IBM@IBMCN Date: 2015/04/28 20:07 Subject: Re: A problem of using spark streaming to capture network packets It's probably not your code. What's the full command line you use to submit the job? Are you sure the job on the cluster has access to the network interface? Can you test the receiver by itself without Spark? For example, does this line work as expected: ListPcapNetworkInterface nifs = Pcaps.findAllDevs(); dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe @deanwampler http://polyglotprogramming.com On Mon, Apr 27, 2015 at 4:03 AM, Hai Shan Wu wuh...@cn.ibm.com wrote: Hi Everyone We use pcap4j to capture network packets and then use spark streaming to analyze captured packets. However, we met a strange problem. If we run our application on spark locally (for example, spark-submit --master local[2]), then the program runs successfully. If we run our application on spark standalone cluster, then the program will tell us that NO NIFs found. I also attach two test files for clarification. So anyone can help on this? Thanks in advance! (See attached file: PcapReceiver.java)(See attached file: TestPcapSpark.java) Best regards, - Haishan Haishan Wu (吴海珊) IBM Research - China Tel: 86-10-58748508 Fax: 86-10-58748330 Email: wuh...@cn.ibm.com Lotus Notes: Hai Shan Wu/China/IBM - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org