Re: Remove dependence on HDFS

2017-02-13 Thread Calvin Jia
Hi Ben, You can replace HDFS with a number of storage systems since Spark is compatible with other storage like S3. This would allow you to scale your compute nodes solely for the purpose of adding compute power and not disk space. You can deploy Alluxio on your compute nodes to offset the

Re: Question about Spark and filesystems

2016-12-19 Thread Calvin Jia
Hi, If you are concerned with the performance of the alternative filesystems (ie. needing a caching client), you can use Alluxio on top of any of NFS , Ceph

Re: About Spark Multiple Shared Context with Spark 2.0

2016-12-13 Thread Calvin Jia
Hi, Alluxio will allow you to share or cache data in-memory between different Spark contexts by storing RDDs or Dataframes as a file in the Alluxio system. The files can then be accessed by any Spark job like a file in any other distributed storage system. These two blogs do a good job of

Re: sanboxing spark executors

2016-11-04 Thread Calvin Jia
Hi, If you are using the latest Alluxio release (1.3.0), authorization is enabled, preventing users from accessing data they do not have permissions to. For older versions, you will need to enable the security flag. The documentation on security

Re: feasibility of ignite and alluxio for interfacing MPI and Spark

2016-09-19 Thread Calvin Jia
Hi, Alluxio allows for data sharing between applications through a File System API (Native Java Alluxio client, Hadoop FileSystem, or POSIX through fuse). If your MPI applications can use any of these interfaces, you should be able to use Alluxio for data sharing out of the box. In terms of

Re: how to use spark.mesos.constraints

2016-07-26 Thread Jia Yu
Hi, I am also trying to use the spark.mesos.constraints but it gives me the same error: job has not be accepted by any resources. I am doubting that I should start some additional service like ./sbin/start-mesos-shuffle-service.sh. Am I correct? Thanks, Jia On Tue, Dec 1, 2015 at 5:14 PM

Re: JavaRDD.foreach (new VoidFunction<>...) always returns the last element

2016-07-25 Thread Jia Zou
Hi Sean, Thanks for your great help! It works all right if I remove persist!! For next step, I will transform those values before persist. I convert to RDD and back to JavaRDD just for testing purposes. Best Regards, Jia On Mon, Jul 25, 2016 at 1:01 PM, Sean Owen <so...@cloudera.com>

JavaRDD.foreach (new VoidFunction<>...) always returns the last element

2016-07-25 Thread Jia Zou
My code is as following: System.out.println("Initialize points..."); JavaPairRDD data = sc.sequenceFile(inputFile, IntWritable.class, DoubleArrayWritable.class);

Spark reduce serialization question

2016-03-04 Thread James Jia
I'm running a distributed KMeans algorithm with 4 executors. I have a RDD[Data]. I use mapPartition to run a learner on each data partition, and then call reduce with my custom model reduce function to reduce the result of the model to start a new iteration. The model size is around ~330 MB. I

Re: how to calculate -- executor-memory,num-executors,total-executor-cores

2016-02-02 Thread Jia Zou
driver (also depends on how your workload interacts with driver) and underlying storage. In my opinion, it may be difficult to derive one generic and easy formular to describe all the dynamic relationships. Best Regards, Jia On Wed, Feb 3, 2016 at 12:13 AM, Divya Gehlot <divya.htco...@gmail.

Re: TTransportException when using Spark 1.6.0 on top of Tachyon 0.8.2

2016-02-01 Thread Jia Zou
configuration I've changed is Tachyon data block size. Above experiment is a part of a research project. Best Regards, Jia On Thursday, January 28, 2016 at 9:11:19 PM UTC-6, Calvin Jia wrote: > > Hi, > > Thanks for the detailed information. How large is the dataset you are > running ag

Re: TTransportException when using Spark 1.6.0 on top of Tachyon 0.8.2

2016-01-28 Thread Calvin Jia
Hi, Thanks for the detailed information. How large is the dataset you are running against? Also did you change any Tachyon configurations? Thanks, Calvin - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For

[Problem Solved]Re: Spark partition size tuning

2016-01-27 Thread Jia Zou
Hi, dears, the problem has been solved. I mistakely use tachyon.user.block.size.bytes instead of tachyon.user.block.size.bytes.default. It works now. Sorry for the confusion and thanks again to Gene! Best Regards, Jia On Wed, Jan 27, 2016 at 4:59 AM, Jia Zou <jacqueline...@gmail.com>

TTransportException when using Spark 1.6.0 on top of Tachyon 0.8.2

2016-01-27 Thread Jia Zou
Dears, I keep getting below exception when using Spark 1.6.0 on top of Tachyon 0.8.2. Tachyon is 93% used and configured as CACHE_THROUGH. Any suggestions will be appreciated, thanks! = Exception in thread "main" org.apache.spark.SparkException: Job aborted

Re: TTransportException when using Spark 1.6.0 on top of Tachyon 0.8.2

2016-01-27 Thread Jia Zou
) at java.io.BufferedInputStream.read(BufferedInputStream.java:334) at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) ... 15 more On Wed, Jan 27, 2016 at 5:02 AM, Jia Zou <jacqueline...@gmail.com> wrote: > Dears, I keep getting below exception when using Sp

Re: TTransportException when using Spark 1.6.0 on top of Tachyon 0.8.2

2016-01-27 Thread Jia Zou
) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) On Wed, Jan 27, 2016 at 5:53 AM, Jia Zou <jacqueline...@gmail.com> wrote: > BTW. The tachyon worker log says

Re: Spark partition size tuning

2016-01-27 Thread Jia Zou
Hi, Gene, Thanks for your suggestion. However, even if I set tachyon.user.block.size.bytes=134217728, and I can see that from the web console, the files that I load to Tachyon via copyToLocal, still has 512MB block size. Do you have more suggestions? Best Regards, Jia On Tue, Jan 26, 2016 at 11

Re: TTransportException when using Spark 1.6.0 on top of Tachyon 0.8.2

2016-01-27 Thread Jia Zou
-10-73-198-35:7077 /home/ubuntu/HiBench/src/sparkbench/target/sparkbench-5.0-SNAPSHOT-MR2-spark1.5-jar-with-dependencies.jar tachyon://localhost:19998/Kmeans/Input/samples 10 5 On Wed, Jan 27, 2016 at 5:02 AM, Jia Zou <jacqueline...@gmail.com> wrote: > Dears, I keep getting below excep

Fwd: Spark partition size tuning

2016-01-25 Thread Jia Zou
thod can't work for Tachyon data. Do you have any suggestions? Thanks very much! Best Regards, Jia -- Forwarded message ------ From: Jia Zou <jacqueline...@gmail.com> Date: Thu, Jan 21, 2016 at 10:05 PM Subject: Spark partition size tuning To: "user @spark" <user

Can Spark read input data from HDFS centralized cache?

2016-01-25 Thread Jia Zou
that Spark doesn't know the data is in HDFS cache, and still read data from disk, instead of from HDFS cache? Thanks! Jia

Spark partition size tuning

2016-01-21 Thread Jia Zou
Dear all! When using Spark to read from local file system, the default partition size is 32MB, how can I increase the partition size to 128MB, to reduce the number of tasks? Thank you very much! Best Regards, Jia

Re: spark 1.6.0 on ec2 doesn't work

2016-01-19 Thread Calvin Jia
Hi Oleg, The Tachyon related issue should be fixed. Hope this helps, Calvin On Mon, Jan 18, 2016 at 2:51 AM, Oleg Ruchovets wrote: > Hi , >I try to follow the spartk 1.6.0 to install spark on EC2. > > It doesn't work properly - got exceptions and at the end

Re: Reuse Executor JVM across different JobContext

2016-01-19 Thread Jia
Hi, Praveen, have you checked out this, which might have the details you need: https://spark-summit.org/2014/wp-content/uploads/2014/07/Spark-Job-Server-Easy-Spark-Job-Management-Chan-Chu.pdf Best Regards, Jia On Jan 19, 2016, at 7:28 AM, praveen S <mylogi...@gmail.com> wrote: > Can

Can I configure Spark on multiple nodes using local filesystem on each node?

2016-01-19 Thread Jia Zou
Dear all, Can I configure Spark on multiple nodes without HDFS, so that output data will be written to the local file system on each node? I guess there is no such feature in Spark, but just want to confirm. Best Regards, Jia

Re: Reuse Executor JVM across different JobContext

2016-01-17 Thread Jia
will be killed. But I hope that all applications submitted can run in the same executor, can JobServer do that? If so, it’s really good news! Best Regards, Jia On Jan 17, 2016, at 3:09 PM, Mark Hamstra <m...@clearstorydata.com> wrote: > You've still got me confused. The SparkConte

Re: Reuse Executor JVM across different JobContext

2016-01-17 Thread Jia
will be killed. But I hope that all applications submitted can run in the same executor, can JobServer do that? If so, it’s really good news! Best Regards, Jia On Jan 17, 2016, at 3:09 PM, Mark Hamstra <m...@clearstorydata.com> wrote: > You've still got me confused. The SparkConte

Re: Reuse Executor JVM across different JobContext

2016-01-17 Thread Jia
Thanks, Mark. Then, I guess JobServer can fundamentally solve my problem, so that jobs can be submitted at different time and still share RDDs. Best Regards, Jia On Jan 17, 2016, at 3:44 PM, Mark Hamstra <m...@clearstorydata.com> wrote: > There is a 1-to-1 relationship betw

Re: Reuse Executor JVM across different JobContext

2016-01-17 Thread Jia Zou
Hi, Mark, sorry, I mean SparkContext. I mean to change Spark into running all submitted jobs (SparkContexts) in one executor JVM. Best Regards, Jia On Sun, Jan 17, 2016 at 2:21 PM, Mark Hamstra <m...@clearstorydata.com> wrote: > -dev > > What do you mean by JobContext? T

Re: Reuse Executor JVM across different JobContext

2016-01-17 Thread Jia
er/shared-SparkContext. > Otherwise, in order share the data in an RDD you have to use an external > storage system, such as a distributed filesystem or Tachyon. > > On Sun, Jan 17, 2016 at 1:52 PM, Jia <jacqueline...@gmail.com> wrote: > Thanks, Mark. Then, I guess JobServer can

Reuse Executor JVM across different JobContext

2016-01-17 Thread Jia Zou
Dear all, Is there a way to reuse executor JVM across different JobContexts? Thanks. Best Regards, Jia

org.apache.spark.storage.BlockNotFoundException in Spark1.5.2+Tachyon0.7.1

2016-01-06 Thread Jia Zou
scheduler.TaskSetManager: Lost task 197.3 in stage 0.0 (TID 210) on executor 10.149.11.81: java.lang.RuntimeException (org.apache.spark.storage.BlockNotFoundException: Block rdd_1_197 not found Can any one give me some suggestions? Thanks a lot! Best Regards, Jia

Re: Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge

2015-12-31 Thread Jia Zou
k you very much! Best Regards, Jia On Wed, Dec 30, 2015 at 9:00 PM, Yanbo Liang <yblia...@gmail.com> wrote: > Hi Jia, > > You can try to use inputRDD.persist(MEMORY_AND_DISK) and verify whether it > can produce stable performance. The storage level of MEMORY_AND_DISK will >

Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge

2015-12-30 Thread Jia Zou
I am running Spark MLLib KMeans in one EC2 M3.2xlarge instance with 8 CPU cores and 30GB memory. Executor memory is set to 15GB, and driver memory is set to 15GB. The observation is that, when input data size is smaller than 15GB, the performance is quite stable. However, when input data becomes

How to use HProf to profile Spark CPU overhead

2015-12-12 Thread Jia Zou
ly profile the CPU usage of the org.apache.spark.deploy.SparkSubmit class, and can not provide insights for other classes like BlockManager, and user classes. Any suggestions? Thanks a lot! Best Regards, Jia

Re: How to use HProf to profile Spark CPU overhead

2015-12-12 Thread Jia Zou
Hi, Ted, it works, thanks a lot for your help! --Jia On Sat, Dec 12, 2015 at 3:01 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Have you tried adding the option below through > spark.executor.extraJavaOptions ? > > Cheers > > > On Dec 13, 2015, at 3:36 AM, Jia Zou <

Re: Saving RDDs in Tachyon

2015-12-09 Thread Calvin Jia
Hi Mark, Were you able to successfully store the RDD with Akhil's method? When you read it back as an objectFile, you will also need to specify the correct type. You can find more information about integrating Spark and Tachyon on this page:

Re: Re: Spark RDD cache persistence

2015-12-09 Thread Calvin Jia
Hi Deepak, For persistence across Spark jobs, you can store and access the RDDs in Tachyon. Tachyon works with ramdisk which would give you similar in-memory performance you would have within a Spark job. For more information, you can take a look at the docs on Tachyon-Spark integration:

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Jia
why we want shared memory. Suggestions will be highly appreciated! Best Regards, Jia On Dec 7, 2015, at 10:54 AM, Robin East <robin.e...@xense.co.uk> wrote: > -dev, +user (this is not a question about development of Spark itself so > you’ll get more answers in the user mailing list) &

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Jia
Thanks, Annabel, but I may need to clarify that I have no intention to write and run Spark UDF in C++, I'm just wondering whether Spark can read and write data to a C++ process with zero copy. Best Regards, Jia On Dec 7, 2015, at 12:26 PM, Annabel Melongo <melongo_anna...@yahoo.com>

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Jia
in Spark community that we can leverage to do this. Best Regards, Jia On Dec 7, 2015, at 11:56 AM, Robin East <robin.e...@xense.co.uk> wrote: > I guess you could write a custom RDD that can read data from a memory-mapped > file - not really my area of expertise so I’ll leave it to o

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Jia
Hi, Kazuaki, It’s very similar with my requirement, thanks! It seems they want to write to a C++ process with zero copy, and I want to do both read/write with zero copy. Any one knows how to obtain more information like current status of this JIRA entry? Best Regards, Jia On Dec 7, 2015

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Jia
more carefully, in case it has a very efficient C++ binding mechanism. Best Regards, Jia On Dec 7, 2015, at 11:46 AM, Dewful <dew...@gmail.com> wrote: > Maybe looking into something like Tachyon would help, I see some sample c++ > bindings, not sure how much of the current f

Re: Spark 1.5.1 Build Failure

2015-10-30 Thread Jia Zhan
pile-first) on project spark-core_2.10: Execution > scala-test-compile-first of goal > net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile failed. > CompileFailed -> [Help 1] > > > > -- > Regards, > Raghuveer Chanda > > -- Jia Zhan

Re: How does Spark coordinate with Tachyon wrt data locality

2015-10-23 Thread Calvin Jia
Hi Shane, Tachyon provides an api to get the block locations of the file which Spark uses when scheduling tasks. Hope this helps, Calvin On Fri, Oct 23, 2015 at 8:15 AM, Kinsella, Shane wrote: > Hi all, > > > > I am looking into how Spark handles data locality wrt

Re: In-memory computing and cache() in Spark

2015-10-19 Thread Jia Zhan
PM, Igor Berman <igor.ber...@gmail.com> wrote: > Does ur iterations really submit job? I dont see any action there > On Oct 17, 2015 00:03, "Jia Zhan" <zhanjia...@gmail.com> wrote: > >> Hi all, >> >> I am running Spark locally in one node and tryin

Re: In-memory computing and cache() in Spark

2015-10-19 Thread Jia Zhan
goy...@gmail.com> wrote: > Hi Jia, > > RDDs are cached on the executor, not on the driver. I am assuming you are > running locally and haven't changed spark.executor.memory? > > Sonal > On Oct 19, 2015 1:58 AM, "Jia Zhan" <zhanjia...@gmail.com> wrote: > >

Re: In-memory computing and cache() in Spark

2015-10-18 Thread Jia Zhan
Anyone has any clue what's going on.? Why would caching with 2g memory much faster than with 15g memory? Thanks very much! On Fri, Oct 16, 2015 at 2:02 PM, Jia Zhan <zhanjia...@gmail.com> wrote: > Hi all, > > I am running Spark locally in one node and trying to sweep

In-memory computing and cache() in Spark

2015-10-16 Thread Jia Zhan
quot;. The application finishes in just 3.6 minutes (3.0min + 9s + 9s + 9s)!! And UI shows 6% of the data is cached. *From the results we can see the reduce stages finish in seconds, how could that happen with only 6% cached? Can anyone explain?* I am new to Spark and would appreciate any help on this. Thanks! Jia

Re: TTL for saveAsObjectFile()

2015-10-14 Thread Calvin Jia
Hi Antonio, I don't think Spark provides a way to pass down params with saveAsObjectFile. One way could be to pass a default TTL in the configuration, but the approach doesn't make much sense since TTL is not necessarily uniform. Baidu will be talking about their use of TTL in Tachyon with Spark

Can we gracefully kill stragglers in Spark SQL

2015-09-04 Thread Jia Zhan
this to work? Is it possible to early terminate some tasks without affecting the overall execution of the job, with some cost of accuracy? Appreciate your help! -- Jia Zhan

Re: Spark is in-memory processing, how then can Tachyon make Spark faster?

2015-08-07 Thread Calvin Jia
Hi, Tachyon http://tachyon-project.org manages memory off heap which can help prevent long GC pauses. Also, using Tachyon will allow the data to be shared between Spark jobs if they use the same dataset. Here's http://www.meetup.com/Tachyon/events/222485713/ a production use case where Baidu

Re: What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?

2015-06-16 Thread Jia Yu
Hi Peng, I got exactly same error! My shuffle data is also very large. Have you figured out a method to solve that? Thanks, Jia On Fri, Apr 24, 2015 at 7:59 AM, Peng Cheng pc...@uow.edu.au wrote: I'm deploying a Spark data processing job on an EC2 cluster, the job is small for the cluster

Help!!!Map or join one large datasets then suddenly remote Akka client disassociated

2015-06-15 Thread Jia Yu
this problem. --- Any help will be appreciated!!! Thanks, Jia

Re: Spark Cluster Benchmarking Frameworks

2015-06-03 Thread Zhen Jia
Hi Jonathan, Maybe you can try BigDataBench. http://prof.ict.ac.cn/BigDataBench/ http://prof.ict.ac.cn/BigDataBench/ . It provides lots of workloads, including both Hadoop and Spark based workloads. Zhen Jia hodgesz wrote Hi Spark Experts, I am curious what people are using to benchmark

Re: Spark SQL 1.3.1 saveAsParquetFile will output tachyon file with different block size

2015-04-28 Thread Calvin Jia
Hi, You can apply this patch https://github.com/apache/spark/pull/5354 and recompile. Hope this helps, Calvin On Tue, Apr 28, 2015 at 1:19 PM, sara mustafa eng.sara.must...@gmail.com wrote: Hi Zhang, How did you compile Spark 1.3.1 with Tachyon? when i changed Tachyon version to 0.6.3 in

Cannot change the memory of workers

2015-04-07 Thread Jia Yu
? Is there a requirement that one worker must maintain 1 gb memory for itself aside from the memory for Spark? Thanks, Jia

Re: LogisticRegressionWithLBFGS shows ERRORs

2015-03-16 Thread Chang-Jia Wang
I just used random numbers.(My ML lib was spark-mllib_2.10-1.2.1)Please see the attached log. In the middle of the log, I dumped the data set before feeding into LogisticRegressionWithLBFGS. The first column false/true was the label (attribute “a”), and columns 2-5 (attributes “x”, “y”, “z”, and