Re: Remove dependence on HDFS

2017-02-13 Thread Calvin Jia
Hi Ben,

You can replace HDFS with a number of storage systems since Spark is
compatible with other storage like S3. This would allow you to scale your
compute nodes solely for the purpose of adding compute power and not disk
space. You can deploy Alluxio on your compute nodes to offset the
performance impact of decoupling your compute and storage, as well as unify
multiple storage spaces if you would like to still use HDFS, S3, and/or
other storage solutions in tandem. Here is an article

which describes a similar architecture.

Hope this helps,
Calvin

On Mon, Feb 13, 2017 at 12:46 AM, Saisai Shao 
wrote:

> IIUC Spark doesn't strongly bind to HDFS, it uses a common FileSystem
> layer which supports different FS implementations, HDFS is just one option.
> You could also use S3 as a backend FS, from Spark's point it is transparent
> to different FS implementations.
>
>
>
> On Sun, Feb 12, 2017 at 5:32 PM, ayan guha  wrote:
>
>> How about adding more NFS storage?
>>
>> On Sun, 12 Feb 2017 at 8:14 pm, Sean Owen  wrote:
>>
>>> Data has to live somewhere -- how do you not add storage but store more
>>> data?  Alluxio is not persistent storage, and S3 isn't on your premises.
>>>
>>> On Sun, Feb 12, 2017 at 4:29 AM Benjamin Kim  wrote:
>>>
>>> Has anyone got some advice on how to remove the reliance on HDFS for
>>> storing persistent data. We have an on-premise Spark cluster. It seems like
>>> a waste of resources to keep adding nodes because of a lack of storage
>>> space only. I would rather add more powerful nodes due to the lack of
>>> processing power at a less frequent rate, than add less powerful nodes at a
>>> more frequent rate just to handle the ever growing data. Can anyone point
>>> me in the right direction? Is Alluxio a good solution? S3? I would like to
>>> hear your thoughts.
>>>
>>> Cheers,
>>> Ben
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


Re: Question about Spark and filesystems

2016-12-19 Thread Calvin Jia
Hi,

If you are concerned with the performance of the alternative filesystems
(ie. needing a caching client), you can use Alluxio on top of any of NFS
,
Ceph

, GlusterFS
,
or other/multiple storages. Especially since your working sets will not be
huge, you most likely will be able to store all the relevant data within
Alluxio during computation, giving you flexibility to store your data in
your preferred storage without performance penalties.

Hope this helps,
Calvin

On Sun, Dec 18, 2016 at 11:23 PM, vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:

> I am using gluster and i have decent performance with basic maintenance
> effort. Advantage of gluster: you can plug Alluxio on top to improve perf
> but I still need to be validate...
>
> Le 18 déc. 2016 8:50 PM,  a écrit :
>
>> Hello,
>>
>> We are trying out Spark for some file processing tasks.
>>
>> Since each Spark worker node needs to access the same files, we have
>> tried using Hdfs. This worked, but there were some oddities making me a
>> bit uneasy. For dependency hell reasons I compiled a modified Spark, and
>> this version exhibited the odd behaviour with Hdfs. The problem might
>> have nothing to do with Hdfs, but the situation made me curious about
>> the alternatives.
>>
>> Now I'm wondering what kind of file system would be suitable for our
>> deployment.
>>
>> - There won't be a great number of nodes. Maybe 10 or so.
>>
>> - The datasets won't be big by big-data standards(Maybe a couple of
>>   hundred gb)
>>
>> So maybe I could just use a NFS server, with a caching client?
>> Or should I try Ceph, or Glusterfs?
>>
>> Does anyone have any experiences to share?
>>
>> --
>> Joakim Verona
>> joa...@verona.se
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: About Spark Multiple Shared Context with Spark 2.0

2016-12-13 Thread Calvin Jia
Hi,

Alluxio will allow you to share or cache data in-memory between different
Spark contexts by storing RDDs or Dataframes as a file in the Alluxio
system. The files can then be accessed by any Spark job like a file in any
other distributed storage system.

These two blogs do a good job of summarizing the end-to-end workflow of
using Alluxio to share RDDs
 or Dataframes
 between
Spark jobs.

Hope this helps,
Calvin

On Tue, Dec 13, 2016 at 3:42 AM, Chetan Khatri 
wrote:

> Hello Guys,
>
> What would be approach to accomplish Spark Multiple Shared Context without
> Alluxio and with with Alluxio , and what would be best practice to achieve
> parallelism and concurrency for spark jobs.
>
> Thanks.
>
> --
> Yours Aye,
> Chetan Khatri.
> M.+91 7 80574 <+91%207%2080574>
> Data Science Researcher
> INDIA
>
> ​​Statement of Confidentiality
> 
> The contents of this e-mail message and any attachments are confidential
> and are intended solely for addressee. The information may also be legally
> privileged. This transmission is sent in trust, for the sole purpose of
> delivery to the intended recipient. If you have received this transmission
> in error, any use, reproduction or dissemination of this transmission is
> strictly prohibited. If you are not the intended recipient, please
> immediately notify the sender by reply e-mail or phone and delete this
> message and its attachments, if any.​​
>


Re: sanboxing spark executors

2016-11-04 Thread Calvin Jia
Hi,

If you are using the latest Alluxio release (1.3.0), authorization is
enabled, preventing users from accessing data they do not have permissions
to. For older versions, you will need to enable the security flag. The
documentation
on security  has more
details.

Hope this helps,
Calvin

On Fri, Nov 4, 2016 at 6:31 AM, Andrew Holway <
andrew.hol...@otternetworks.de> wrote:

> I think running it on a Mesos cluster could give you better control over
> this kinda stuff.
>
>
> On Fri, Nov 4, 2016 at 7:41 AM, blazespinnaker 
> wrote:
>
>> Is there a good method / discussion / documentation on how to sandbox a
>> spark
>> executor?   Assume the code is untrusted and you don't want it to be able
>> to
>> make un validated network connections or do unvalidated alluxio/hdfs/file
>> io.
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/sanboxing-spark-executors-tp28014.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Otter Networks UG
> http://otternetworks.de
> Gotenstraße 17
> 10829 Berlin
>


Re: feasibility of ignite and alluxio for interfacing MPI and Spark

2016-09-19 Thread Calvin Jia
Hi,

Alluxio allows for data sharing between applications through a File System
API (Native Java Alluxio client, Hadoop FileSystem, or POSIX through fuse).
If your MPI applications can use any of these interfaces, you should be
able to use Alluxio for data sharing out of the box.

In terms of duplicating in-memory data, you should only need one copy in
Alluxio if you are able to stream your dataset. As for the performance of
using Alluxio to back your data compared to using Spark's native in-memory
representation, here is a blog
 which
details the pros and cons of the two approaches. At a high level, Alluxio
performs better with larger datasets or if you plan to use your dataset in
more than one Spark job.

Hope this helps,
Calvin


Re: how to use spark.mesos.constraints

2016-07-26 Thread Jia Yu
Hi,

I am also trying to use the spark.mesos.constraints but it gives me the
same error: job has not be accepted by any resources.

I am doubting that I should start some additional service like
./sbin/start-mesos-shuffle-service.sh. Am I correct?

Thanks,
Jia

On Tue, Dec 1, 2015 at 5:14 PM, rarediel <bryce.ag...@gettyimages.com>
wrote:

> I am trying to add mesos constraints to my spark-submit command in my
> marathon file I am setting it to spark.mesos.coarse=true.
>
> Here is an example of a constraint I am trying to set.
>
>  --conf spark.mesos.constraint=cpus:2
>
> I want to use the constraints to control the amount of executors are
> created
> so I can control the total memory of my spark job.
>
> I've tried many variations of resource constraints, but no matter which
> resource or what number, range, etc. I do I always get the error "Initial
> job has not accepted any resources; check your cluster UI...".  My cluster
> has the available resources.  Is there any examples I can look at where
> people use resource constraints?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-use-spark-mesos-constraints-tp25541.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: JavaRDD.foreach (new VoidFunction<>...) always returns the last element

2016-07-25 Thread Jia Zou
Hi Sean,

Thanks for your great help! It works all right if I remove persist!!

For next step, I will transform those values before persist.
I convert to RDD and back to JavaRDD just for testing purposes.

Best Regards,
Jia

On Mon, Jul 25, 2016 at 1:01 PM, Sean Owen <so...@cloudera.com> wrote:

> Why are you converting to RDD and back to JavaRDD?
> The problem is storing references to Writable, which are mutated by the
> InputFormat. Somewhere you have 1000 refs to the same key. I think it may
> be the persist. You want to immediately transform these values to something
> besides a Writable.
>
> On Mon, Jul 25, 2016, 18:50 Jia Zou <jacqueline...@gmail.com> wrote:
>
>>
>> My code is as following:
>>
>> System.out.println("Initialize points...");
>>
>> JavaPairRDD<IntWritable, DoubleArrayWritable> data =
>>
>> sc.sequenceFile(inputFile, IntWritable.
>> class, DoubleArrayWritable.class);
>>
>> RDD<Tuple2<IntWritable, DoubleArrayWritable>> rdd =
>>
>> JavaPairRDD.toRDD(data);
>>
>> JavaRDD<Tuple2<IntWritable, DoubleArrayWritable>> points
>> = JavaRDD.fromRDD(rdd, data.classTag());
>>
>> points.persist(StorageLevel.MEMORY_ONLY());
>>
>> int i;
>>
>>
>>   for (i=0; i<iterations; i++) {
>>
>> System.out.println("iteration="+i);
>>
>> //points.foreach(new
>> ForEachMapPointToCluster(numDimensions, numClusters));
>>
>> points.foreach(new
>> VoidFunction<Tuple2<IntWritable, DoubleArrayWritable>>() {
>>
>> public void call(Tuple2<IntWritable,
>> DoubleArrayWritable> tuple) {
>>
>> IntWritable key = tuple._1();
>>
>> System.out.println("key:"+key.get());
>>
>> DoubleArrayWritable array = tuple._2();
>>
>> double[] point = array.getData();
>>
>> for (int d = 0; d < 20; d ++) {
>>
>> System.out.println(d+":"+point[d]);
>>
>> }
>>
>> }
>>
>> });
>>
>> }
>>
>>
>> The output is a lot of following, only the last element in the rdd has
>> been output.
>>
>> key:999
>>
>> 0:0.9953839426689233
>>
>> 1:0.12656798341145892
>>
>> 2:0.16621114723289654
>>
>> 3:0.48628049787614236
>>
>> 4:0.476991470215116
>>
>> 5:0.5033640235789054
>>
>> 6:0.09257098597507829
>>
>> 7:0.3153088440494892
>>
>> 8:0.8807426085223242
>>
>> 9:0.2809625780570739
>>
>> 10:0.9584880094505738
>>
>> 11:0.38521222520661547
>>
>> 12:0.5114241334425228
>>
>> 13:0.9524628903835111
>>
>> 14:0.5252549496842003
>>
>> 15:0.5732037830866236
>>
>> 16:0.8632451606583632
>>
>> 17:0.39754347061499895
>>
>> 18:0.2859522809981715
>>
>> 19:0.2659002343432888
>>
>> key:999
>>
>> 0:0.9953839426689233
>>
>> 1:0.12656798341145892
>>
>> 2:0.16621114723289654
>>
>> 3:0.48628049787614236
>>
>> 4:0.476991470215116
>>
>> 5:0.5033640235789054
>>
>> 6:0.09257098597507829
>>
>> 7:0.3153088440494892
>>
>> 8:0.8807426085223242
>>
>> 9:0.2809625780570739
>>
>> 10:0.9584880094505738
>>
>> 11:0.38521222520661547
>>
>> 12:0.5114241334425228
>>
>> 13:0.9524628903835111
>>
>> 14:0.5252549496842003
>>
>> 15:0.5732037830866236
>>
>> 16:0.8632451606583632
>>
>> 17:0.39754347061499895
>>
>> 18:0.2859522809981715
>>
>> 19:0.2659002343432888
>>
>> key:999
>>
>> 0:0.9953839426689233
>>
>> 1:0.12656798341145892
>>
>> 2:0.16621114723289654
>>
>> 3:0.48628049787614236
>>
>> 4:0.476991470215116
>>
>> 5:0.5033640235789054
>>
>> 6:0.09257098597507829
>>
>> 7:0.3153088440494892
>>
>> 8:0.8807426085223242
>>
>> 9:0.2809625780570739
>>
>> 10:0.9584880094505738
>>
>> 11:0.38521222520661547
>>
>> 12:0.5114241334425228
>>
>> 13:0.9524628903835111
>>
>> 14:0.5252549496842003
>>
>> 15:0.5732037830866236
>>
>> 16:0.8632451606583632
>>
>> 17:0.39754347061499895
>>
>> 18:0.2859522809981715
>>
>> 19:0.2659002343432888
>>
>


JavaRDD.foreach (new VoidFunction<>...) always returns the last element

2016-07-25 Thread Jia Zou
My code is as following:

System.out.println("Initialize points...");

JavaPairRDD data =

sc.sequenceFile(inputFile, IntWritable.class,
DoubleArrayWritable.class);

RDD> rdd =

JavaPairRDD.toRDD(data);

JavaRDD> points =
JavaRDD.fromRDD(rdd, data.classTag());

points.persist(StorageLevel.MEMORY_ONLY());

int i;


  for (i=0; i

Spark reduce serialization question

2016-03-04 Thread James Jia
I'm running a distributed KMeans algorithm with 4 executors.

I have a RDD[Data]. I use mapPartition to run a learner on each data
partition, and then call reduce with my custom model reduce function
to reduce the result of the model to start a new iteration.

The model size is around ~330 MB. I would expect that the required
memory for the serialized result at the driver to be at most 2*300 MB
in order to reduce two models, but it looks like Spark is serializing
all of the models to the driver before reducing.

The error message says that the total size of the serialized results
is 1345.5MB, which is approximately 4 * 330 MB. I know I can set the
driver's max result size, but I just want to confirm that this is
expected behavior.

Thanks!

James

Stage 0:==>(1
+ 3) / 4]16/02/19 05:59:28 ERROR TaskSetManager: Total size of
serialized results of 4 tasks (1345.5 MB) is bigger than
spark.driver.maxResultSize (1024.0 MB)

org.apache.spark.SparkException: Job aborted due to stage failure:
Total size of serialized results of 4 tasks (1345.5 MB) is bigger than
spark.driver.maxResultSize (1024.0 MB)

  at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)

  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)

  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)

  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)

  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)

  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)

  at scala.Option.foreach(Option.scala:257)

  at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)

  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)

  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)

  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)

  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)

  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)

  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)

  at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1007)

  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)

  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)

  at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)

  at org.apache.spark.rdd.RDD.reduce(RDD.scala:989)

  at BIDMach.RunOnSpark$.runOnSpark(RunOnSpark.scala:50)

  ... 50 elided


Re: how to calculate -- executor-memory,num-executors,total-executor-cores

2016-02-02 Thread Jia Zou
Divya,

According to my recent Spark tuning experiences, optimal executor-memory
size not only depends on your workload characteristics (e.g. working set
size at each job stage) and input data size, but also depends on your total
available memory and memory requirements of other components like driver
(also depends on how your workload interacts with driver) and underlying
storage. In my opinion, it may be difficult to derive one generic and easy
formular to describe all the dynamic relationships.


Best Regards,
Jia

On Wed, Feb 3, 2016 at 12:13 AM, Divya Gehlot <divya.htco...@gmail.com>
wrote:

> Hi,
>
> I would like to know how to calculate how much  -executor-memory should we
> allocate , how many num-executors,total-executor-cores we should give while
> submitting spark jobs .
> Is there any formula for it ?
>
>
> Thanks,
> Divya
>


Re: TTransportException when using Spark 1.6.0 on top of Tachyon 0.8.2

2016-02-01 Thread Jia Zou
Hi, Calvin, I am running  24GB data Spark KMeans in a c3.2xlarge AWS 
instance with 30GB physical memory.
Spark will cache data off-heap to Tachyon, the input data is also stored in 
Tachyon.
Tachyon is configured to use 15GB memory, and use tired store.
Tachyon underFS is /tmp.

The only configuration I've changed is Tachyon data block size.

Above experiment is a part of a research project.

Best Regards,
Jia

On Thursday, January 28, 2016 at 9:11:19 PM UTC-6, Calvin Jia wrote:
>
> Hi,
>
> Thanks for the detailed information. How large is the dataset you are 
> running against? Also did you change any Tachyon configurations?
>
> Thanks,
> Calvin
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: TTransportException when using Spark 1.6.0 on top of Tachyon 0.8.2

2016-01-28 Thread Calvin Jia
Hi,

Thanks for the detailed information. How large is the dataset you are 
running against? Also did you change any Tachyon configurations?

Thanks,
Calvin

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

[Problem Solved]Re: Spark partition size tuning

2016-01-27 Thread Jia Zou
Hi, dears, the problem has been solved.
I mistakely use tachyon.user.block.size.bytes instead of
tachyon.user.block.size.bytes.default. It works now. Sorry for the
confusion and thanks again to Gene!

Best Regards,
Jia

On Wed, Jan 27, 2016 at 4:59 AM, Jia Zou <jacqueline...@gmail.com> wrote:

> Hi, Gene,
>
> Thanks for your suggestion.
> However, even if I set tachyon.user.block.size.bytes=134217728, and I can
> see that from the web console, the files that I load to Tachyon via
> copyToLocal, still has 512MB block size.
> Do you have more suggestions?
>
> Best Regards,
> Jia
>
> On Tue, Jan 26, 2016 at 11:46 PM, Gene Pang <gene.p...@gmail.com> wrote:
>
>> Hi Jia,
>>
>> If you want to change the Tachyon block size, you can set the
>> tachyon.user.block.size.bytes.default parameter (
>> http://tachyon-project.org/documentation/Configuration-Settings.html).
>> You can set it via extraJavaOptions per job, or adding it to
>> tachyon-site.properties.
>>
>> I hope that helps,
>> Gene
>>
>> On Mon, Jan 25, 2016 at 8:13 PM, Jia Zou <jacqueline...@gmail.com> wrote:
>>
>>> Dear all,
>>>
>>> First to update that the local file system data partition size can be
>>> tuned by:
>>> sc.hadoopConfiguration().setLong("fs.local.block.size", blocksize)
>>>
>>> However, I also need to tune Spark data partition size for input data
>>> that is stored in Tachyon (default is 512MB), but above method can't work
>>> for Tachyon data.
>>>
>>> Do you have any suggestions? Thanks very much!
>>>
>>> Best Regards,
>>> Jia
>>>
>>>
>>> -- Forwarded message --
>>> From: Jia Zou <jacqueline...@gmail.com>
>>> Date: Thu, Jan 21, 2016 at 10:05 PM
>>> Subject: Spark partition size tuning
>>> To: "user @spark" <user@spark.apache.org>
>>>
>>>
>>> Dear all!
>>>
>>> When using Spark to read from local file system, the default partition
>>> size is 32MB, how can I increase the partition size to 128MB, to reduce the
>>> number of tasks?
>>>
>>> Thank you very much!
>>>
>>> Best Regards,
>>> Jia
>>>
>>>
>>
>


TTransportException when using Spark 1.6.0 on top of Tachyon 0.8.2

2016-01-27 Thread Jia Zou
Dears, I keep getting below exception when using Spark 1.6.0 on top of
Tachyon 0.8.2. Tachyon is 93% used and configured as CACHE_THROUGH.

Any suggestions will be appreciated, thanks!

=

Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task 13 in stage 0.0 failed 4 times, most recent failure:
Lost task 13.3 in stage 0.0 (TID 33, ip-10-73-198-35.ec2.internal):
java.io.IOException: tachyon.org.apache.thrift.transport.TTransportException

at tachyon.worker.WorkerClient.unlockBlock(WorkerClient.java:416)

at tachyon.client.block.LocalBlockInStream.close(LocalBlockInStream.java:87)

at tachyon.client.file.FileInStream.close(FileInStream.java:105)

at tachyon.hadoop.HdfsFileInputStream.read(HdfsFileInputStream.java:171)

at java.io.DataInputStream.readInt(DataInputStream.java:388)

at
org.apache.hadoop.io.SequenceFile$Reader.readRecordLength(SequenceFile.java:2325)

at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2356)

at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2493)

at
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:82)

at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:246)

at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:208)

at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)

at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)

at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)

at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:193)

at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)

at
org.apache.spark.rdd.RDD$$anonfun$zip$1$$anonfun$apply$31$$anon$1.hasNext(RDD.scala:851)

at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)

at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1595)

at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1143)

at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1143)

at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)

at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)

at org.apache.spark.scheduler.Task.run(Task.scala:89)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Caused by: tachyon.org.apache.thrift.transport.TTransportException

at
tachyon.org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)

at
tachyon.org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)

at
tachyon.org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)

at
tachyon.org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)

at
tachyon.org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)

at
tachyon.org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)

at
tachyon.org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)

at
tachyon.org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)

at
tachyon.org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)

at
tachyon.thrift.WorkerService$Client.recv_unlockBlock(WorkerService.java:455)

at tachyon.thrift.WorkerService$Client.unlockBlock(WorkerService.java:441)

at tachyon.worker.WorkerClient.unlockBlock(WorkerClient.java:413)

... 28 more


Re: TTransportException when using Spark 1.6.0 on top of Tachyon 0.8.2

2016-01-27 Thread Jia Zou
BTW. The tachyon worker log says following:



2015-12-27 01:33:44,599 ERROR WORKER_LOGGER
(WorkerBlockMasterClient.java:getId) - java.net.SocketException: Connection
reset

org.apache.thrift.transport.TTransportException: java.net.SocketException:
Connection reset

at
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)

at
org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)

at
org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)

at
org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)

at
org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)

at
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)

at
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)

at
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)

at
org.apache.thrift.protocol.TProtocolDecorator.readMessageBegin(TProtocolDecorator.java:135)

at
org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)

at
tachyon.thrift.BlockMasterService$Client.recv_workerGetWorkerId(BlockMasterService.java:235)

at
tachyon.thrift.BlockMasterService$Client.workerGetWorkerId(BlockMasterService.java:222)

at
tachyon.client.WorkerBlockMasterClient.getId(WorkerBlockMasterClient.java:103)

at
tachyon.worker.WorkerIdRegistry.registerWithBlockMaster(WorkerIdRegistry.java:59)

at tachyon.worker.block.BlockWorker.(BlockWorker.java:200)

at tachyon.worker.TachyonWorker.main(TachyonWorker.java:42)

Caused by: java.net.SocketException: Connection reset

at java.net.SocketInputStream.read(SocketInputStream.java:196)

at java.net.SocketInputStream.read(SocketInputStream.java:122)

at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)

at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)

at java.io.BufferedInputStream.read(BufferedInputStream.java:334)

at
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)

... 15 more

On Wed, Jan 27, 2016 at 5:02 AM, Jia Zou <jacqueline...@gmail.com> wrote:

> Dears, I keep getting below exception when using Spark 1.6.0 on top of
> Tachyon 0.8.2. Tachyon is 93% used and configured as CACHE_THROUGH.
>
> Any suggestions will be appreciated, thanks!
>
> =
>
> Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due to stage failure: Task 13 in stage 0.0 failed 4 times, most recent
> failure: Lost task 13.3 in stage 0.0 (TID 33,
> ip-10-73-198-35.ec2.internal): java.io.IOException:
> tachyon.org.apache.thrift.transport.TTransportException
>
> at tachyon.worker.WorkerClient.unlockBlock(WorkerClient.java:416)
>
> at
> tachyon.client.block.LocalBlockInStream.close(LocalBlockInStream.java:87)
>
> at tachyon.client.file.FileInStream.close(FileInStream.java:105)
>
> at tachyon.hadoop.HdfsFileInputStream.read(HdfsFileInputStream.java:171)
>
> at java.io.DataInputStream.readInt(DataInputStream.java:388)
>
> at
> org.apache.hadoop.io.SequenceFile$Reader.readRecordLength(SequenceFile.java:2325)
>
> at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2356)
>
> at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2493)
>
> at
> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:82)
>
> at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:246)
>
> at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:208)
>
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>
> at
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
>
> at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:193)
>
> at
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>
> at
> org.apache.spark.rdd.RDD$$anonfun$zip$1$$anonfun$apply$31$$anon$1.hasNext(RDD.scala:851)
>
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
>
> at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1595)
>
> at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1143)
>
> at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1143)
>
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
>
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
>
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
&g

Re: TTransportException when using Spark 1.6.0 on top of Tachyon 0.8.2

2016-01-27 Thread Jia Zou
BTW. At the end of the log, I also find a lot of errors like below:

=

2016-01-27 11:47:18,515 ERROR server.TThreadPoolServer
(TThreadPoolServer.java:run) - Error occurred during processing of message.

java.lang.NullPointerException

at
tachyon.worker.block.BlockLockManager.unlockBlock(BlockLockManager.java:142)

at
tachyon.worker.block.TieredBlockStore.unlockBlock(TieredBlockStore.java:148)

at
tachyon.worker.block.BlockDataManager.unlockBlock(BlockDataManager.java:476)

at
tachyon.worker.block.BlockServiceHandler.unlockBlock(BlockServiceHandler.java:232)

at
tachyon.thrift.WorkerService$Processor$unlockBlock.getResult(WorkerService.java:1150)

at
tachyon.thrift.WorkerService$Processor$unlockBlock.getResult(WorkerService.java:1135)

at
org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)

at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)

at
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)


On Wed, Jan 27, 2016 at 5:53 AM, Jia Zou <jacqueline...@gmail.com> wrote:

> BTW. The tachyon worker log says following:
>
> 
>
> 2015-12-27 01:33:44,599 ERROR WORKER_LOGGER
> (WorkerBlockMasterClient.java:getId) - java.net.SocketException: Connection
> reset
>
> org.apache.thrift.transport.TTransportException: java.net.SocketException:
> Connection reset
>
> at
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
>
> at
> org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
>
> at
> org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
>
> at
> org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
>
> at
> org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
>
> at
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
>
> at
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
>
> at
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
>
> at
> org.apache.thrift.protocol.TProtocolDecorator.readMessageBegin(TProtocolDecorator.java:135)
>
> at
> org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
>
> at
> tachyon.thrift.BlockMasterService$Client.recv_workerGetWorkerId(BlockMasterService.java:235)
>
> at
> tachyon.thrift.BlockMasterService$Client.workerGetWorkerId(BlockMasterService.java:222)
>
> at
> tachyon.client.WorkerBlockMasterClient.getId(WorkerBlockMasterClient.java:103)
>
> at
> tachyon.worker.WorkerIdRegistry.registerWithBlockMaster(WorkerIdRegistry.java:59)
>
> at tachyon.worker.block.BlockWorker.(BlockWorker.java:200)
>
> at tachyon.worker.TachyonWorker.main(TachyonWorker.java:42)
>
> Caused by: java.net.SocketException: Connection reset
>
> at java.net.SocketInputStream.read(SocketInputStream.java:196)
>
> at java.net.SocketInputStream.read(SocketInputStream.java:122)
>
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
>
> at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>
> at
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
>
> ... 15 more
>
> On Wed, Jan 27, 2016 at 5:02 AM, Jia Zou <jacqueline...@gmail.com> wrote:
>
>> Dears, I keep getting below exception when using Spark 1.6.0 on top of
>> Tachyon 0.8.2. Tachyon is 93% used and configured as CACHE_THROUGH.
>>
>> Any suggestions will be appreciated, thanks!
>>
>> =
>>
>> Exception in thread "main" org.apache.spark.SparkException: Job aborted
>> due to stage failure: Task 13 in stage 0.0 failed 4 times, most recent
>> failure: Lost task 13.3 in stage 0.0 (TID 33,
>> ip-10-73-198-35.ec2.internal): java.io.IOException:
>> tachyon.org.apache.thrift.transport.TTransportException
>>
>> at tachyon.worker.WorkerClient.unlockBlock(WorkerClient.java:416)
>>
>> at
>> tachyon.client.block.LocalBlockInStream.close(LocalBlockInStream.java:87)
>>
>> at tachyon.client.

Re: Spark partition size tuning

2016-01-27 Thread Jia Zou
Hi, Gene,

Thanks for your suggestion.
However, even if I set tachyon.user.block.size.bytes=134217728, and I can
see that from the web console, the files that I load to Tachyon via
copyToLocal, still has 512MB block size.
Do you have more suggestions?

Best Regards,
Jia

On Tue, Jan 26, 2016 at 11:46 PM, Gene Pang <gene.p...@gmail.com> wrote:

> Hi Jia,
>
> If you want to change the Tachyon block size, you can set the
> tachyon.user.block.size.bytes.default parameter (
> http://tachyon-project.org/documentation/Configuration-Settings.html).
> You can set it via extraJavaOptions per job, or adding it to
> tachyon-site.properties.
>
> I hope that helps,
> Gene
>
> On Mon, Jan 25, 2016 at 8:13 PM, Jia Zou <jacqueline...@gmail.com> wrote:
>
>> Dear all,
>>
>> First to update that the local file system data partition size can be
>> tuned by:
>> sc.hadoopConfiguration().setLong("fs.local.block.size", blocksize)
>>
>> However, I also need to tune Spark data partition size for input data
>> that is stored in Tachyon (default is 512MB), but above method can't work
>> for Tachyon data.
>>
>> Do you have any suggestions? Thanks very much!
>>
>> Best Regards,
>> Jia
>>
>>
>> -- Forwarded message --
>> From: Jia Zou <jacqueline...@gmail.com>
>> Date: Thu, Jan 21, 2016 at 10:05 PM
>> Subject: Spark partition size tuning
>> To: "user @spark" <user@spark.apache.org>
>>
>>
>> Dear all!
>>
>> When using Spark to read from local file system, the default partition
>> size is 32MB, how can I increase the partition size to 128MB, to reduce the
>> number of tasks?
>>
>> Thank you very much!
>>
>> Best Regards,
>> Jia
>>
>>
>


Re: TTransportException when using Spark 1.6.0 on top of Tachyon 0.8.2

2016-01-27 Thread Jia Zou
BTW. the error happens when configure Spark to read input file from Tachyon
like following:

/home/ubuntu/spark-1.6.0/bin/spark-submit  --properties-file
/home/ubuntu/HiBench/report/kmeans/spark/java/conf/sparkbench/spark.conf
--class org.apache.spark.examples.mllib.JavaKMeans --master spark://ip
-10-73-198-35:7077
/home/ubuntu/HiBench/src/sparkbench/target/sparkbench-5.0-SNAPSHOT-MR2-spark1.5-jar-with-dependencies.jar
tachyon://localhost:19998/Kmeans/Input/samples 10 5

On Wed, Jan 27, 2016 at 5:02 AM, Jia Zou <jacqueline...@gmail.com> wrote:

> Dears, I keep getting below exception when using Spark 1.6.0 on top of
> Tachyon 0.8.2. Tachyon is 93% used and configured as CACHE_THROUGH.
>
> Any suggestions will be appreciated, thanks!
>
> =
>
> Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due to stage failure: Task 13 in stage 0.0 failed 4 times, most recent
> failure: Lost task 13.3 in stage 0.0 (TID 33,
> ip-10-73-198-35.ec2.internal): java.io.IOException:
> tachyon.org.apache.thrift.transport.TTransportException
>
> at tachyon.worker.WorkerClient.unlockBlock(WorkerClient.java:416)
>
> at
> tachyon.client.block.LocalBlockInStream.close(LocalBlockInStream.java:87)
>
> at tachyon.client.file.FileInStream.close(FileInStream.java:105)
>
> at tachyon.hadoop.HdfsFileInputStream.read(HdfsFileInputStream.java:171)
>
> at java.io.DataInputStream.readInt(DataInputStream.java:388)
>
> at
> org.apache.hadoop.io.SequenceFile$Reader.readRecordLength(SequenceFile.java:2325)
>
> at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2356)
>
> at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2493)
>
> at
> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:82)
>
> at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:246)
>
> at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:208)
>
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>
> at
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
>
> at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:193)
>
> at
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>
> at
> org.apache.spark.rdd.RDD$$anonfun$zip$1$$anonfun$apply$31$$anon$1.hasNext(RDD.scala:851)
>
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
>
> at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1595)
>
> at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1143)
>
> at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1143)
>
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
>
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
>
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
> Caused by: tachyon.org.apache.thrift.transport.TTransportException
>
> at
> tachyon.org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
>
> at
> tachyon.org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
>
> at
> tachyon.org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
>
> at
> tachyon.org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
>
> at
> tachyon.org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
>
> at
> tachyon.org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
>
> at
> tachyon.org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
>
> at
> tachyon.org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
>
> at
> tachyon.org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
>
> at
> tachyon.thrift.WorkerService$Client.recv_unlockBlock(WorkerService.java:455)
>
> at tachyon.thrift.WorkerService$Client.unlockBlock(WorkerService.java:441)
>
> at tachyon.worker.WorkerClient.unlockBlock(WorkerClient.java:413)
>
> ... 28 more
>
>
>


Fwd: Spark partition size tuning

2016-01-25 Thread Jia Zou
Dear all,

First to update that the local file system data partition size can be tuned
by:
sc.hadoopConfiguration().setLong("fs.local.block.size", blocksize)

However, I also need to tune Spark data partition size for input data that
is stored in Tachyon (default is 512MB), but above method can't work for
Tachyon data.

Do you have any suggestions? Thanks very much!

Best Regards,
Jia


-- Forwarded message ------
From: Jia Zou <jacqueline...@gmail.com>
Date: Thu, Jan 21, 2016 at 10:05 PM
Subject: Spark partition size tuning
To: "user @spark" <user@spark.apache.org>


Dear all!

When using Spark to read from local file system, the default partition size
is 32MB, how can I increase the partition size to 128MB, to reduce the
number of tasks?

Thank you very much!

Best Regards,
Jia


Can Spark read input data from HDFS centralized cache?

2016-01-25 Thread Jia Zou
I configured HDFS to cache file in HDFS's cache, like following:

hdfs cacheadmin -addPool hibench

hdfs cacheadmin -addDirective -path /HiBench/Kmeans/Input -pool hibench


But I didn't see much performance impacts, no matter how I configure
dfs.datanode.max.locked.memory


Is it possible that Spark doesn't know the data is in HDFS cache, and still
read data from disk, instead of from HDFS cache?


Thanks!

Jia


Spark partition size tuning

2016-01-21 Thread Jia Zou
Dear all!

When using Spark to read from local file system, the default partition size
is 32MB, how can I increase the partition size to 128MB, to reduce the
number of tasks?

Thank you very much!

Best Regards,
Jia


Re: spark 1.6.0 on ec2 doesn't work

2016-01-19 Thread Calvin Jia
Hi Oleg,

The Tachyon related issue should be fixed.

Hope this helps,
Calvin

On Mon, Jan 18, 2016 at 2:51 AM, Oleg Ruchovets 
wrote:

> Hi ,
>I try to follow the spartk 1.6.0 to install spark on EC2.
>
> It doesn't work properly -  got exceptions and at the end standalone spark
> cluster installed.
> here is log information:
>
> Any suggestions?
>
> Thanks
> Oleg.
>
> oleg@robinhood:~/install/spark-1.6.0-bin-hadoop2.6/ec2$ ./spark-ec2
> --key-pair=CC-ES-Demo
>  
> --identity-file=/home/oleg/work/entity_extraction_framework/ec2_pem_key/CC-ES-Demo.pem
> --region=us-east-1 --zone=us-east-1a --spot-price=0.05   -s 5
> --spark-version=1.6.0launch entity-extraction-spark-cluster
> Setting up security groups...
> Searching for existing cluster entity-extraction-spark-cluster in region
> us-east-1...
> Spark AMI: ami-5bb18832
> Launching instances...
> Requesting 5 slaves as spot instances with price $0.050
> Waiting for spot instances to be granted...
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> All 5 slaves granted
> Launched master in us-east-1a, regid = r-9384033f
> Waiting for AWS to propagate instance metadata...
> Waiting for cluster to enter 'ssh-ready' state..
>
> Warning: SSH connection error. (This could be temporary.)
> Host: ec2-52-90-186-83.compute-1.amazonaws.com
> SSH return code: 255
> SSH output: ssh: connect to host ec2-52-90-186-83.compute-1.amazonaws.com
> port 22: Connection refused
>
> .
>
> Warning: SSH connection error. (This could be temporary.)
> Host: ec2-52-90-186-83.compute-1.amazonaws.com
> SSH return code: 255
> SSH output: ssh: connect to host ec2-52-90-186-83.compute-1.amazonaws.com
> port 22: Connection refused
>
> .
>
> Warning: SSH connection error. (This could be temporary.)
> Host: ec2-52-90-186-83.compute-1.amazonaws.com
> SSH return code: 255
> SSH output: ssh: connect to host ec2-52-90-186-83.compute-1.amazonaws.com
> port 22: Connection refused
>
> .
> Cluster is now in 'ssh-ready' state. Waited 442 seconds.
> Generating cluster's SSH key on master...
> Warning: Permanently added 
> 'ec2-52-90-186-83.compute-1.amazonaws.com,52.90.186.83'
> (ECDSA) to the list of known hosts.
> Connection to ec2-52-90-186-83.compute-1.amazonaws.com closed.
> Warning: Permanently added 
> 'ec2-52-90-186-83.compute-1.amazonaws.com,52.90.186.83'
> (ECDSA) to the list of known hosts.
> Transferring cluster's SSH key to slaves...
> ec2-54-165-243-74.compute-1.amazonaws.com
> Warning: Permanently added 
> 'ec2-54-165-243-74.compute-1.amazonaws.com,54.165.243.74'
> (ECDSA) to the list of known hosts.
> ec2-54-88-245-107.compute-1.amazonaws.com
> Warning: Permanently added 
> 'ec2-54-88-245-107.compute-1.amazonaws.com,54.88.245.107'
> (ECDSA) to the list of known hosts.
> ec2-54-172-29-47.compute-1.amazonaws.com
> Warning: Permanently added 
> 'ec2-54-172-29-47.compute-1.amazonaws.com,54.172.29.47'
> (ECDSA) to the list of known hosts.
> ec2-54-165-131-210.compute-1.amazonaws.com
> Warning: Permanently added 
> 'ec2-54-165-131-210.compute-1.amazonaws.com,54.165.131.210'
> (ECDSA) to the list of known hosts.
> ec2-54-172-46-184.compute-1.amazonaws.com
> Warning: Permanently added 
> 'ec2-54-172-46-184.compute-1.amazonaws.com,54.172.46.184'
> (ECDSA) to the list of known hosts.
> Cloning spark-ec2 scripts from
> https://github.com/amplab/spark-ec2/tree/branch-1.5 on master...
> Warning: Permanently added 
> 'ec2-52-90-186-83.compute-1.amazonaws.com,52.90.186.83'
> (ECDSA) to the list of known hosts.
> Cloning into 'spark-ec2'...
> remote: Counting objects: 2068, done.
> remote: Total 2068 (delta 0), reused 0 (delta 0), pack-reused 2068
> Receiving objects: 100% (2068/2068), 349.76 KiB, done.
> Resolving deltas: 100% (796/796), done.
> Connection to ec2-52-90-186-83.compute-1.amazonaws.com closed.
> Deploying files to master...
> Warning: Permanently added 
> 'ec2-52-90-186-83.compute-1.amazonaws.com,52.90.186.83'
> (ECDSA) to the list of known hosts.
> sending incremental file list
> root/spark-ec2/ec2-variables.sh
>
> sent 1,835 bytes  received 40 bytes  416.67 bytes/sec
> total size is 1,684  speedup is 0.90
> Running setup on master...
> Warning: Permanently added 
> 'ec2-52-90-186-83.compute-1.amazonaws.com,52.90.186.83'
> (ECDSA) to the list of known hosts.
> Connection to ec2-52-90-186-83.compute-1.amazonaws.com closed.
> Warning: Permanently added 
> 'ec2-52-90-186-83.compute-1.amazonaws.com,52.90.186.83'
> (ECDSA) to the list of known hosts.
> Setting up Spark on ip-172-31-24-124.ec2.internal...
> Setting executable permissions on scripts...
> RSYNC'ing /root/spark-ec2 to other cluster nodes...
> 

Re: Reuse Executor JVM across different JobContext

2016-01-19 Thread Jia
Hi, Praveen, have you checked out this, which might have the details you need:
https://spark-summit.org/2014/wp-content/uploads/2014/07/Spark-Job-Server-Easy-Spark-Job-Management-Chan-Chu.pdf

Best Regards,
Jia


On Jan 19, 2016, at 7:28 AM, praveen S <mylogi...@gmail.com> wrote:

> Can you give me more details on Spark's jobserver.
> 
> Regards, 
> Praveen
> 
> On 18 Jan 2016 03:30, "Jia" <jacqueline...@gmail.com> wrote:
> I guess all jobs submitted through JobServer are executed in the same JVM, so 
> RDDs cached by one job can be visible to all other jobs executed later.
> On Jan 17, 2016, at 3:56 PM, Mark Hamstra <m...@clearstorydata.com> wrote:
> 
>> Yes, that is one of the basic reasons to use a 
>> jobserver/shared-SparkContext.  Otherwise, in order share the data in an RDD 
>> you have to use an external storage system, such as a distributed filesystem 
>> or Tachyon.
>> 
>> On Sun, Jan 17, 2016 at 1:52 PM, Jia <jacqueline...@gmail.com> wrote:
>> Thanks, Mark. Then, I guess JobServer can fundamentally solve my problem, so 
>> that jobs can be submitted at different time and still share RDDs.
>> 
>> Best Regards,
>> Jia
>> 
>> 
>> On Jan 17, 2016, at 3:44 PM, Mark Hamstra <m...@clearstorydata.com> wrote:
>> 
>>> There is a 1-to-1 relationship between Spark Applications and SparkContexts 
>>> -- fundamentally, a Spark Applications is a program that creates and uses a 
>>> SparkContext, and that SparkContext is destroyed when then Application 
>>> ends.  A jobserver generically and the Spark JobServer specifically is an 
>>> Application that keeps a SparkContext open for a long time and allows many 
>>> Jobs to be be submitted and run using that shared SparkContext.
>>> 
>>> More than one Application/SparkContext unavoidably implies more than one 
>>> JVM process per Worker -- Applications/SparkContexts cannot share JVM 
>>> processes.  
>>> 
>>> On Sun, Jan 17, 2016 at 1:15 PM, Jia <jacqueline...@gmail.com> wrote:
>>> Hi, Mark, sorry for the confusion.
>>> 
>>> Let me clarify, when an application is submitted, the master will tell each 
>>> Spark worker to spawn an executor JVM process. All the task sets  of the 
>>> application will be executed by the executor. After the application runs to 
>>> completion. The executor process will be killed.
>>> But I hope that all applications submitted can run in the same executor, 
>>> can JobServer do that? If so, it’s really good news!
>>> 
>>> Best Regards,
>>> Jia
>>> 
>>> On Jan 17, 2016, at 3:09 PM, Mark Hamstra <m...@clearstorydata.com> wrote:
>>> 
>>>> You've still got me confused.  The SparkContext exists at the Driver, not 
>>>> on an Executor.
>>>> 
>>>> Many Jobs can be run by a SparkContext -- it is a common pattern to use 
>>>> something like the Spark Jobserver where all Jobs are run through a shared 
>>>> SparkContext.
>>>> 
>>>> On Sun, Jan 17, 2016 at 12:57 PM, Jia Zou <jacqueline...@gmail.com> wrote:
>>>> Hi, Mark, sorry, I mean SparkContext.
>>>> I mean to change Spark into running all submitted jobs (SparkContexts) in 
>>>> one executor JVM.
>>>> 
>>>> Best Regards,
>>>> Jia
>>>> 
>>>> On Sun, Jan 17, 2016 at 2:21 PM, Mark Hamstra <m...@clearstorydata.com> 
>>>> wrote:
>>>> -dev
>>>> 
>>>> What do you mean by JobContext?  That is a Hadoop mapreduce concept, not 
>>>> Spark.
>>>> 
>>>> On Sun, Jan 17, 2016 at 7:29 AM, Jia Zou <jacqueline...@gmail.com> wrote:
>>>> Dear all,
>>>> 
>>>> Is there a way to reuse executor JVM across different JobContexts? Thanks.
>>>> 
>>>> Best Regards,
>>>> Jia
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 



Can I configure Spark on multiple nodes using local filesystem on each node?

2016-01-19 Thread Jia Zou
Dear all,

Can I configure Spark on multiple nodes without HDFS, so that output data
will be written to the local file system on each node?

I guess there is no such feature in Spark, but just want to confirm.

Best Regards,
Jia


Re: Reuse Executor JVM across different JobContext

2016-01-17 Thread Jia
Hi, Mark, sorry for the confusion.

Let me clarify, when an application is submitted, the master will tell each 
Spark worker to spawn an executor JVM process. All the task sets  of the 
application will be executed by the executor. After the application runs to 
completion. The executor process will be killed.
But I hope that all applications submitted can run in the same executor, can 
JobServer do that? If so, it’s really good news!

Best Regards,
Jia

On Jan 17, 2016, at 3:09 PM, Mark Hamstra <m...@clearstorydata.com> wrote:

> You've still got me confused.  The SparkContext exists at the Driver, not on 
> an Executor.
> 
> Many Jobs can be run by a SparkContext -- it is a common pattern to use 
> something like the Spark Jobserver where all Jobs are run through a shared 
> SparkContext.
> 
> On Sun, Jan 17, 2016 at 12:57 PM, Jia Zou <jacqueline...@gmail.com> wrote:
> Hi, Mark, sorry, I mean SparkContext.
> I mean to change Spark into running all submitted jobs (SparkContexts) in one 
> executor JVM.
> 
> Best Regards,
> Jia
> 
> On Sun, Jan 17, 2016 at 2:21 PM, Mark Hamstra <m...@clearstorydata.com> wrote:
> -dev
> 
> What do you mean by JobContext?  That is a Hadoop mapreduce concept, not 
> Spark.
> 
> On Sun, Jan 17, 2016 at 7:29 AM, Jia Zou <jacqueline...@gmail.com> wrote:
> Dear all,
> 
> Is there a way to reuse executor JVM across different JobContexts? Thanks.
> 
> Best Regards,
> Jia
> 
> 
> 



Re: Reuse Executor JVM across different JobContext

2016-01-17 Thread Jia
Hi, Mark, sorry for the confusion.

Let me clarify, when an application is submitted, the master will tell each 
Spark worker to spawn an executor JVM process. All the task sets  of the 
application will be executed by the executor. After the application runs to 
completion. The executor process will be killed.
But I hope that all applications submitted can run in the same executor, can 
JobServer do that? If so, it’s really good news!

Best Regards,
Jia

On Jan 17, 2016, at 3:09 PM, Mark Hamstra <m...@clearstorydata.com> wrote:

> You've still got me confused.  The SparkContext exists at the Driver, not on 
> an Executor.
> 
> Many Jobs can be run by a SparkContext -- it is a common pattern to use 
> something like the Spark Jobserver where all Jobs are run through a shared 
> SparkContext.
> 
> On Sun, Jan 17, 2016 at 12:57 PM, Jia Zou <jacqueline...@gmail.com> wrote:
> Hi, Mark, sorry, I mean SparkContext.
> I mean to change Spark into running all submitted jobs (SparkContexts) in one 
> executor JVM.
> 
> Best Regards,
> Jia
> 
> On Sun, Jan 17, 2016 at 2:21 PM, Mark Hamstra <m...@clearstorydata.com> wrote:
> -dev
> 
> What do you mean by JobContext?  That is a Hadoop mapreduce concept, not 
> Spark.
> 
> On Sun, Jan 17, 2016 at 7:29 AM, Jia Zou <jacqueline...@gmail.com> wrote:
> Dear all,
> 
> Is there a way to reuse executor JVM across different JobContexts? Thanks.
> 
> Best Regards,
> Jia
> 
> 
> 



Re: Reuse Executor JVM across different JobContext

2016-01-17 Thread Jia
Thanks, Mark. Then, I guess JobServer can fundamentally solve my problem, so 
that jobs can be submitted at different time and still share RDDs.

Best Regards,
Jia


On Jan 17, 2016, at 3:44 PM, Mark Hamstra <m...@clearstorydata.com> wrote:

> There is a 1-to-1 relationship between Spark Applications and SparkContexts 
> -- fundamentally, a Spark Applications is a program that creates and uses a 
> SparkContext, and that SparkContext is destroyed when then Application ends.  
> A jobserver generically and the Spark JobServer specifically is an 
> Application that keeps a SparkContext open for a long time and allows many 
> Jobs to be be submitted and run using that shared SparkContext.
> 
> More than one Application/SparkContext unavoidably implies more than one JVM 
> process per Worker -- Applications/SparkContexts cannot share JVM processes.  
> 
> On Sun, Jan 17, 2016 at 1:15 PM, Jia <jacqueline...@gmail.com> wrote:
> Hi, Mark, sorry for the confusion.
> 
> Let me clarify, when an application is submitted, the master will tell each 
> Spark worker to spawn an executor JVM process. All the task sets  of the 
> application will be executed by the executor. After the application runs to 
> completion. The executor process will be killed.
> But I hope that all applications submitted can run in the same executor, can 
> JobServer do that? If so, it’s really good news!
> 
> Best Regards,
> Jia
> 
> On Jan 17, 2016, at 3:09 PM, Mark Hamstra <m...@clearstorydata.com> wrote:
> 
>> You've still got me confused.  The SparkContext exists at the Driver, not on 
>> an Executor.
>> 
>> Many Jobs can be run by a SparkContext -- it is a common pattern to use 
>> something like the Spark Jobserver where all Jobs are run through a shared 
>> SparkContext.
>> 
>> On Sun, Jan 17, 2016 at 12:57 PM, Jia Zou <jacqueline...@gmail.com> wrote:
>> Hi, Mark, sorry, I mean SparkContext.
>> I mean to change Spark into running all submitted jobs (SparkContexts) in 
>> one executor JVM.
>> 
>> Best Regards,
>> Jia
>> 
>> On Sun, Jan 17, 2016 at 2:21 PM, Mark Hamstra <m...@clearstorydata.com> 
>> wrote:
>> -dev
>> 
>> What do you mean by JobContext?  That is a Hadoop mapreduce concept, not 
>> Spark.
>> 
>> On Sun, Jan 17, 2016 at 7:29 AM, Jia Zou <jacqueline...@gmail.com> wrote:
>> Dear all,
>> 
>> Is there a way to reuse executor JVM across different JobContexts? Thanks.
>> 
>> Best Regards,
>> Jia
>> 
>> 
>> 
> 
> 



Re: Reuse Executor JVM across different JobContext

2016-01-17 Thread Jia Zou
Hi, Mark, sorry, I mean SparkContext.
I mean to change Spark into running all submitted jobs (SparkContexts) in
one executor JVM.

Best Regards,
Jia

On Sun, Jan 17, 2016 at 2:21 PM, Mark Hamstra <m...@clearstorydata.com>
wrote:

> -dev
>
> What do you mean by JobContext?  That is a Hadoop mapreduce concept, not
> Spark.
>
> On Sun, Jan 17, 2016 at 7:29 AM, Jia Zou <jacqueline...@gmail.com> wrote:
>
>> Dear all,
>>
>> Is there a way to reuse executor JVM across different JobContexts? Thanks.
>>
>> Best Regards,
>> Jia
>>
>
>


Re: Reuse Executor JVM across different JobContext

2016-01-17 Thread Jia
I guess all jobs submitted through JobServer are executed in the same JVM, so 
RDDs cached by one job can be visible to all other jobs executed later.
On Jan 17, 2016, at 3:56 PM, Mark Hamstra <m...@clearstorydata.com> wrote:

> Yes, that is one of the basic reasons to use a jobserver/shared-SparkContext. 
>  Otherwise, in order share the data in an RDD you have to use an external 
> storage system, such as a distributed filesystem or Tachyon.
> 
> On Sun, Jan 17, 2016 at 1:52 PM, Jia <jacqueline...@gmail.com> wrote:
> Thanks, Mark. Then, I guess JobServer can fundamentally solve my problem, so 
> that jobs can be submitted at different time and still share RDDs.
> 
> Best Regards,
> Jia
> 
> 
> On Jan 17, 2016, at 3:44 PM, Mark Hamstra <m...@clearstorydata.com> wrote:
> 
>> There is a 1-to-1 relationship between Spark Applications and SparkContexts 
>> -- fundamentally, a Spark Applications is a program that creates and uses a 
>> SparkContext, and that SparkContext is destroyed when then Application ends. 
>>  A jobserver generically and the Spark JobServer specifically is an 
>> Application that keeps a SparkContext open for a long time and allows many 
>> Jobs to be be submitted and run using that shared SparkContext.
>> 
>> More than one Application/SparkContext unavoidably implies more than one JVM 
>> process per Worker -- Applications/SparkContexts cannot share JVM processes. 
>>  
>> 
>> On Sun, Jan 17, 2016 at 1:15 PM, Jia <jacqueline...@gmail.com> wrote:
>> Hi, Mark, sorry for the confusion.
>> 
>> Let me clarify, when an application is submitted, the master will tell each 
>> Spark worker to spawn an executor JVM process. All the task sets  of the 
>> application will be executed by the executor. After the application runs to 
>> completion. The executor process will be killed.
>> But I hope that all applications submitted can run in the same executor, can 
>> JobServer do that? If so, it’s really good news!
>> 
>> Best Regards,
>> Jia
>> 
>> On Jan 17, 2016, at 3:09 PM, Mark Hamstra <m...@clearstorydata.com> wrote:
>> 
>>> You've still got me confused.  The SparkContext exists at the Driver, not 
>>> on an Executor.
>>> 
>>> Many Jobs can be run by a SparkContext -- it is a common pattern to use 
>>> something like the Spark Jobserver where all Jobs are run through a shared 
>>> SparkContext.
>>> 
>>> On Sun, Jan 17, 2016 at 12:57 PM, Jia Zou <jacqueline...@gmail.com> wrote:
>>> Hi, Mark, sorry, I mean SparkContext.
>>> I mean to change Spark into running all submitted jobs (SparkContexts) in 
>>> one executor JVM.
>>> 
>>> Best Regards,
>>> Jia
>>> 
>>> On Sun, Jan 17, 2016 at 2:21 PM, Mark Hamstra <m...@clearstorydata.com> 
>>> wrote:
>>> -dev
>>> 
>>> What do you mean by JobContext?  That is a Hadoop mapreduce concept, not 
>>> Spark.
>>> 
>>> On Sun, Jan 17, 2016 at 7:29 AM, Jia Zou <jacqueline...@gmail.com> wrote:
>>> Dear all,
>>> 
>>> Is there a way to reuse executor JVM across different JobContexts? Thanks.
>>> 
>>> Best Regards,
>>> Jia
>>> 
>>> 
>>> 
>> 
>> 
> 
> 



Reuse Executor JVM across different JobContext

2016-01-17 Thread Jia Zou
Dear all,

Is there a way to reuse executor JVM across different JobContexts? Thanks.

Best Regards,
Jia


org.apache.spark.storage.BlockNotFoundException in Spark1.5.2+Tachyon0.7.1

2016-01-06 Thread Jia Zou
Dear all,

I am using Spark1.5.2 and Tachyon0.7.1 to run KMeans with
inputRDD.persist(StorageLevel.OFF_HEAP()).

I've set tired storage for Tachyon. It is all right when working set is
smaller than available memory. However, when working set exceeds available
memory, I keep getting errors like below:

16/01/07 04:18:53 INFO scheduler.TaskSetManager: Lost task 197.1 in stage
0.0 (TID 206) on executor 10.149.11.81: java.lang.RuntimeException
(org.apache.spark.storage.BlockNotFoundException: Block rdd_1_197 not found

16/01/07 04:18:53 INFO scheduler.TaskSetManager: Lost task 191.1 in stage
0.0 (TID 207) on executor 10.149.11.81: java.lang.RuntimeException
(org.apache.spark.storage.BlockNotFoundException: Block rdd_1_191 not found

16/01/07 04:18:53 INFO scheduler.TaskSetManager: Lost task 197.2 in stage
0.0 (TID 208) on executor 10.149.11.81: java.lang.RuntimeException
(org.apache.spark.storage.BlockNotFoundException: Block rdd_1_197 not found

16/01/07 04:18:53 INFO scheduler.TaskSetManager: Lost task 191.2 in stage
0.0 (TID 209) on executor 10.149.11.81: java.lang.RuntimeException
(org.apache.spark.storage.BlockNotFoundException: Block rdd_1_191 not found

16/01/07 04:18:53 INFO scheduler.TaskSetManager: Lost task 197.3 in stage
0.0 (TID 210) on executor 10.149.11.81: java.lang.RuntimeException
(org.apache.spark.storage.BlockNotFoundException: Block rdd_1_197 not found


Can any one give me some suggestions? Thanks a lot!


Best Regards,
Jia


Re: Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge

2015-12-31 Thread Jia Zou
Thanks, Yanbo.
The results become much more reasonable, after I set driver memory to 5GB
and increase worker memory to 25GB.

So, my question is for following code snippet extracted from main method in
JavaKMeans.java in examples, what will the driver do? and what will the
worker do?

I didn't understand this problem well by reading
https://spark.apache.org/docs/1.1.0/cluster-overview.htmland
http://stackoverflow.com/questions/27181737/how-to-deal-with-executor-memory-and-driver-memory-in-spark

SparkConf sparkConf = new SparkConf().setAppName("JavaKMeans");

JavaSparkContext sc = new JavaSparkContext(sparkConf);

JavaRDD lines = sc.textFile(inputFile);

JavaRDD points = lines.map(new ParsePoint());

 points.persist(StorageLevel.MEMORY_AND_DISK());

KMeansModel model = KMeans.train(points.rdd(), k, iterations, runs,
KMeans.K_MEANS_PARALLEL());


Thank you very much!

Best Regards,
Jia

On Wed, Dec 30, 2015 at 9:00 PM, Yanbo Liang <yblia...@gmail.com> wrote:

> Hi Jia,
>
> You can try to use inputRDD.persist(MEMORY_AND_DISK) and verify whether it
> can produce stable performance. The storage level of MEMORY_AND_DISK will
> store the partitions that don't fit on disk and read them from there when
> they are needed.
> Actually, it's not necessary to set so large driver memory in your case,
> because KMeans use low memory for driver if your k is not very large.
>
> Cheers
> Yanbo
>
> 2015-12-30 22:20 GMT+08:00 Jia Zou <jacqueline...@gmail.com>:
>
>> I am running Spark MLLib KMeans in one EC2 M3.2xlarge instance with 8 CPU
>> cores and 30GB memory. Executor memory is set to 15GB, and driver memory is
>> set to 15GB.
>>
>> The observation is that, when input data size is smaller than 15GB, the
>> performance is quite stable. However, when input data becomes larger than
>> that, the performance will be extremely unpredictable. For example, for
>> 15GB input, with inputRDD.persist(MEMORY_ONLY) , I've got three
>> dramatically different testing results: 27mins, 61mins and 114 mins. (All
>> settings are the same for the 3 tests, and I will create input data
>> immediately before running each of the tests to keep OS buffer cache hot.)
>>
>> Anyone can help to explain this? Thanks very much!
>>
>>
>


Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge

2015-12-30 Thread Jia Zou
I am running Spark MLLib KMeans in one EC2 M3.2xlarge instance with 8 CPU
cores and 30GB memory. Executor memory is set to 15GB, and driver memory is
set to 15GB.

The observation is that, when input data size is smaller than 15GB, the
performance is quite stable. However, when input data becomes larger than
that, the performance will be extremely unpredictable. For example, for
15GB input, with inputRDD.persist(MEMORY_ONLY) , I've got three
dramatically different testing results: 27mins, 61mins and 114 mins. (All
settings are the same for the 3 tests, and I will create input data
immediately before running each of the tests to keep OS buffer cache hot.)

Anyone can help to explain this? Thanks very much!


How to use HProf to profile Spark CPU overhead

2015-12-12 Thread Jia Zou
My goal is to use hprof to profile where the bottleneck is.
Is there anyway to do this without modifying and rebuilding Spark source
code.

I've tried to add "
-Xrunhprof:cpu=samples,depth=100,interval=20,lineno=y,thread=y,file=/home/ubuntu/out.hprof"
to spark-class script, but it can only profile the CPU usage of the
org.apache.spark.deploy.SparkSubmit
class, and can not provide insights for other classes like BlockManager,
and user classes.

Any suggestions? Thanks a lot!

Best Regards,
Jia


Re: How to use HProf to profile Spark CPU overhead

2015-12-12 Thread Jia Zou
Hi, Ted, it works, thanks a lot for your help!

--Jia

On Sat, Dec 12, 2015 at 3:01 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> Have you tried adding the option below through
> spark.executor.extraJavaOptions ?
>
> Cheers
>
> > On Dec 13, 2015, at 3:36 AM, Jia Zou <jacqueline...@gmail.com> wrote:
> >
> > My goal is to use hprof to profile where the bottleneck is.
> > Is there anyway to do this without modifying and rebuilding Spark source
> code.
> >
> > I've tried to add
> "-Xrunhprof:cpu=samples,depth=100,interval=20,lineno=y,thread=y,file=/home/ubuntu/out.hprof"
> to spark-class script, but it can only profile the CPU usage of the
> org.apache.spark.deploy.SparkSubmit class, and can not provide insights for
> other classes like BlockManager, and user classes.
> >
> > Any suggestions? Thanks a lot!
> >
> > Best Regards,
> > Jia
>


Re: Saving RDDs in Tachyon

2015-12-09 Thread Calvin Jia
Hi Mark,

Were you able to successfully store the RDD with Akhil's method? When you
read it back as an objectFile, you will also need to specify the correct
type.

You can find more information about integrating Spark and Tachyon on this
page: http://tachyon-project.org/documentation/Running-Spark-on-Tachyon.html
.

Hope this helps,
Calvin

On Fri, Oct 30, 2015 at 7:04 AM, Akhil Das 
wrote:

> I guess you can do a .saveAsObjectFiles and read it back as sc.objectFile
>
> Thanks
> Best Regards
>
> On Fri, Oct 23, 2015 at 7:57 AM, mark  wrote:
>
>> I have Avro records stored in Parquet files in HDFS. I want to read these
>> out as an RDD and save that RDD in Tachyon for any spark job that wants the
>> data.
>>
>> How do I save the RDD in Tachyon? What format do I use? Which RDD
>> 'saveAs...' method do I want?
>>
>> Thanks
>>
>
>


Re: Re: Spark RDD cache persistence

2015-12-09 Thread Calvin Jia
Hi Deepak,

For persistence across Spark jobs, you can store and access the RDDs in
Tachyon. Tachyon works with ramdisk which would give you similar in-memory
performance you would have within a Spark job.

For more information, you can take a look at the docs on Tachyon-Spark
integration:
http://tachyon-project.org/documentation/Running-Spark-on-Tachyon.html

Hope this helps,
Calvin

On Thu, Nov 5, 2015 at 10:29 PM, Deenar Toraskar 
wrote:

> You can have a long running Spark context in several fashions. This will
> ensure your data will be cached in memory. Clients will access the RDD
> through a REST API that you can expose. See the Spark Job Server, it does
> something similar. It has something called Named RDDs
>
> Using Named RDDs
>
> Named RDDs are a way to easily share RDDs among job. Using this facility,
> computed RDDs can be cached with a given name and later on retrieved. To
> use this feature, the SparkJob needs to mixinNamedRddSupport:
>
> Alternatively if you use the Spark Thrift Server, any cached
> dataframes/RDDs will be available to all clients of Spark via the Thrift
> Server until it is shutdown.
>
> If you want to support key value lookups you might want to use IndexedRDD
> 
>
> Finally not the same as sharing RDDs, Tachyon can cache underlying HDFS
> blocks.
>
> Deenar
>
> *Think Reactive Ltd*
> deenar.toras...@thinkreactive.co.uk
> 07714140812
>
>
>
> On 6 November 2015 at 05:56, r7raul1...@163.com 
>  wrote:
>
>> You can try
>> http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html#Archival_Storage_SSD__Memory
>>  .
>>   Hive tmp table use this function to speed
>
>
> On 6 November 2015 at 05:56, r7raul1...@163.com 
> wrote:
>
>> You can try
>> http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html#Archival_Storage_SSD__Memory
>>  .
>>   Hive tmp table use this function to speed job.
>> https://issues.apache.org/jira/browse/HIVE-7313
>>
>> --
>> r7raul1...@163.com
>>
>>
>> *From:* Christian 
>> *Date:* 2015-11-06 13:50
>> *To:* Deepak Sharma 
>> *CC:* user 
>> *Subject:* Re: Spark RDD cache persistence
>> I've never had this need and I've never done it. There are options that
>> allow this. For example, I know there are web apps out there that work like
>> the spark REPL. One of these I think is called Zepplin. . I've never used
>> them, but I've seen them demoed. There is also Tachyon that Spark
>> supports.. Hopefully, that gives you a place to start.
>> On Thu, Nov 5, 2015 at 9:21 PM Deepak Sharma 
>> wrote:
>>
>>> Thanks Christian.
>>> So is there any inbuilt mechanism in spark or api integration  to other
>>> inmemory cache products such as redis to load the RDD to these system upon
>>> program exit ?
>>> What's the best approach to have long lived RDD cache ?
>>> Thanks
>>>
>>>
>>> Deepak
>>> On 6 Nov 2015 8:34 am, "Christian"  wrote:
>>>
 The cache gets cleared out when the job finishes. I am not aware of a
 way to keep the cache around between jobs. You could save it as an object
 file to disk and load it as an object file on your next job for speed.
 On Thu, Nov 5, 2015 at 6:17 PM Deepak Sharma 
 wrote:

> Hi All
> I am confused on RDD persistence in cache .
> If I cache RDD , is it going to stay there in memory even if my spark
> program completes execution , which created it.
> If not , how can I guarantee that RDD is persisted in cache even after
> the program finishes execution.
>
> Thanks
>
>
> Deepak
>

>


Re: Shared memory between C++ process and Spark

2015-12-07 Thread Jia
Hi, Robin, 
Thanks for your reply and thanks for copying my question to user mailing list.
Yes, we have a distributed C++ application, that will store data on each node 
in the cluster, and we hope to leverage Spark to do more fancy analytics on 
those data. But we need high performance, that’s why we want shared memory.
Suggestions will be highly appreciated!

Best Regards,
Jia

On Dec 7, 2015, at 10:54 AM, Robin East <robin.e...@xense.co.uk> wrote:

> -dev, +user (this is not a question about development of Spark itself so 
> you’ll get more answers in the user mailing list)
> 
> First up let me say that I don’t really know how this could be done - I’m 
> sure it would be possible with enough tinkering but it’s not clear what you 
> are trying to achieve. Spark is a distributed processing system, it has 
> multiple JVMs running on different machines that each run a small part of the 
> overall processing. Unless you have some sort of idea to have multiple C++ 
> processes collocated with the distributed JVMs using named memory mapped 
> files doesn’t make architectural sense. 
> ---
> Robin East
> Spark GraphX in Action Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
> 
> 
> 
> 
> 
>> On 6 Dec 2015, at 20:43, Jia <jacqueline...@gmail.com> wrote:
>> 
>> Dears, for one project, I need to implement something so Spark can read data 
>> from a C++ process. 
>> To provide high performance, I really hope to implement this through shared 
>> memory between the C++ process and Java JVM process.
>> It seems it may be possible to use named memory mapped files and JNI to do 
>> this, but I wonder whether there is any existing efforts or more efficient 
>> approach to do this?
>> Thank you very much!
>> 
>> Best Regards,
>> Jia
>> 
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>> 
> 



Re: Shared memory between C++ process and Spark

2015-12-07 Thread Jia
Thanks, Annabel, but I may need to clarify that I have no intention to write 
and run Spark UDF in C++, I'm just wondering whether Spark can read and write 
data to a C++ process with zero copy.

Best Regards,
Jia
 


On Dec 7, 2015, at 12:26 PM, Annabel Melongo <melongo_anna...@yahoo.com> wrote:

> My guess is that Jia wants to run C++ on top of Spark. If that's the case, 
> I'm afraid this is not possible. Spark has support for Java, Python, Scala 
> and R.
> 
> The best way to achieve this is to run your application in C++ and used the 
> data created by said application to do manipulation within Spark.
> 
> 
> 
> On Monday, December 7, 2015 1:15 PM, Jia <jacqueline...@gmail.com> wrote:
> 
> 
> Thanks, Dewful!
> 
> My impression is that Tachyon is a very nice in-memory file system that can 
> connect to multiple storages.
> However, because our data is also hold in memory, I suspect that connecting 
> to Spark directly may be more efficient in performance.
> But definitely I need to look at Tachyon more carefully, in case it has a 
> very efficient C++ binding mechanism.
> 
> Best Regards,
> Jia
> 
> On Dec 7, 2015, at 11:46 AM, Dewful <dew...@gmail.com> wrote:
> 
>> Maybe looking into something like Tachyon would help, I see some sample c++ 
>> bindings, not sure how much of the current functionality they support...
>> Hi, Robin, 
>> Thanks for your reply and thanks for copying my question to user mailing 
>> list.
>> Yes, we have a distributed C++ application, that will store data on each 
>> node in the cluster, and we hope to leverage Spark to do more fancy 
>> analytics on those data. But we need high performance, that’s why we want 
>> shared memory.
>> Suggestions will be highly appreciated!
>> 
>> Best Regards,
>> Jia
>> 
>> On Dec 7, 2015, at 10:54 AM, Robin East <robin.e...@xense.co.uk> wrote:
>> 
>>> -dev, +user (this is not a question about development of Spark itself so 
>>> you’ll get more answers in the user mailing list)
>>> 
>>> First up let me say that I don’t really know how this could be done - I’m 
>>> sure it would be possible with enough tinkering but it’s not clear what you 
>>> are trying to achieve. Spark is a distributed processing system, it has 
>>> multiple JVMs running on different machines that each run a small part of 
>>> the overall processing. Unless you have some sort of idea to have multiple 
>>> C++ processes collocated with the distributed JVMs using named memory 
>>> mapped files doesn’t make architectural sense. 
>>> ---
>>> Robin East
>>> Spark GraphX in Action Michael Malak and Robin East
>>> Manning Publications Co.
>>> http://www.manning.com/books/spark-graphx-in-action
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> On 6 Dec 2015, at 20:43, Jia <jacqueline...@gmail.com> wrote:
>>>> 
>>>> Dears, for one project, I need to implement something so Spark can read 
>>>> data from a C++ process. 
>>>> To provide high performance, I really hope to implement this through 
>>>> shared memory between the C++ process and Java JVM process.
>>>> It seems it may be possible to use named memory mapped files and JNI to do 
>>>> this, but I wonder whether there is any existing efforts or more efficient 
>>>> approach to do this?
>>>> Thank you very much!
>>>> 
>>>> Best Regards,
>>>> Jia
>>>> 
>>>> 
>>>> -
>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>> 
>>> 
>> 
> 
> 
> 



Re: Shared memory between C++ process and Spark

2015-12-07 Thread Jia
Thanks, Robin, you have a very good point!
We feel that the data copy and allocation overhead may become a performance 
bottleneck, and is evaluating it right now.
We will do the shared memory stuff only if we’re sure about the potential 
performance gain and sure that there is no existing stuff in Spark community 
that we can leverage to do this.

Best Regards,
Jia


On Dec 7, 2015, at 11:56 AM, Robin East <robin.e...@xense.co.uk> wrote:

> I guess you could write a custom RDD that can read data from a memory-mapped 
> file - not really my area of expertise so I’ll leave it to other members of 
> the forum to chip in with comments as to whether that makes sense. 
> 
> But if you want ‘fancy analytics’ then won’t the processing time more than 
> out-weigh the savings from using memory mapped files? Particularly if your 
> analytics involve any kind of aggregation of data across data nodes. Have you 
> looked at a Lambda architecture which could involve Spark but doesn’t 
> necessarily mean you would go to the trouble of implementing a custom 
> memory-mapped file reading feature.
> ---
> Robin East
> Spark GraphX in Action Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
> 
> 
> 
> 
> 
>> On 7 Dec 2015, at 17:32, Jia <jacqueline...@gmail.com> wrote:
>> 
>> Hi, Robin, 
>> Thanks for your reply and thanks for copying my question to user mailing 
>> list.
>> Yes, we have a distributed C++ application, that will store data on each 
>> node in the cluster, and we hope to leverage Spark to do more fancy 
>> analytics on those data. But we need high performance, that’s why we want 
>> shared memory.
>> Suggestions will be highly appreciated!
>> 
>> Best Regards,
>> Jia
>> 
>> On Dec 7, 2015, at 10:54 AM, Robin East <robin.e...@xense.co.uk> wrote:
>> 
>>> -dev, +user (this is not a question about development of Spark itself so 
>>> you’ll get more answers in the user mailing list)
>>> 
>>> First up let me say that I don’t really know how this could be done - I’m 
>>> sure it would be possible with enough tinkering but it’s not clear what you 
>>> are trying to achieve. Spark is a distributed processing system, it has 
>>> multiple JVMs running on different machines that each run a small part of 
>>> the overall processing. Unless you have some sort of idea to have multiple 
>>> C++ processes collocated with the distributed JVMs using named memory 
>>> mapped files doesn’t make architectural sense. 
>>> -------
>>> Robin East
>>> Spark GraphX in Action Michael Malak and Robin East
>>> Manning Publications Co.
>>> http://www.manning.com/books/spark-graphx-in-action
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> On 6 Dec 2015, at 20:43, Jia <jacqueline...@gmail.com> wrote:
>>>> 
>>>> Dears, for one project, I need to implement something so Spark can read 
>>>> data from a C++ process. 
>>>> To provide high performance, I really hope to implement this through 
>>>> shared memory between the C++ process and Java JVM process.
>>>> It seems it may be possible to use named memory mapped files and JNI to do 
>>>> this, but I wonder whether there is any existing efforts or more efficient 
>>>> approach to do this?
>>>> Thank you very much!
>>>> 
>>>> Best Regards,
>>>> Jia
>>>> 
>>>> 
>>>> -
>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>> 
>>> 
>> 
> 



Re: Shared memory between C++ process and Spark

2015-12-07 Thread Jia
Hi, Kazuaki,

It’s very similar with my requirement, thanks!
It seems they want to write to a C++ process with zero copy, and I want to do 
both read/write with zero copy.
Any one knows how to obtain more information like current status of this JIRA 
entry?

Best Regards,
Jia




On Dec 7, 2015, at 12:26 PM, Kazuaki Ishizaki <ishiz...@jp.ibm.com> wrote:

> Is this JIRA entry related to what you want?
> https://issues.apache.org/jira/browse/SPARK-10399
> 
> Regards,
> Kazuaki Ishizaki
> 
> 
> 
> From:Jia <jacqueline...@gmail.com>
> To:Dewful <dew...@gmail.com>
> Cc:"user @spark" <user@spark.apache.org>, d...@spark.apache.org, 
> Robin East <robin.e...@xense.co.uk>
> Date:2015/12/08 03:17
> Subject:Re: Shared memory between C++ process and Spark
> 
> 
> 
> Thanks, Dewful!
> 
> My impression is that Tachyon is a very nice in-memory file system that can 
> connect to multiple storages.
> However, because our data is also hold in memory, I suspect that connecting 
> to Spark directly may be more efficient in performance.
> But definitely I need to look at Tachyon more carefully, in case it has a 
> very efficient C++ binding mechanism.
> 
> Best Regards,
> Jia
> 
> On Dec 7, 2015, at 11:46 AM, Dewful <dew...@gmail.com> wrote:
> Maybe looking into something like Tachyon would help, I see some sample c++ 
> bindings, not sure how much of the current functionality they support...
> Hi, Robin, 
> Thanks for your reply and thanks for copying my question to user mailing list.
> Yes, we have a distributed C++ application, that will store data on each node 
> in the cluster, and we hope to leverage Spark to do more fancy analytics on 
> those data. But we need high performance, that’s why we want shared memory.
> Suggestions will be highly appreciated!
> 
> Best Regards,
> Jia
> 
> On Dec 7, 2015, at 10:54 AM, Robin East <robin.e...@xense.co.uk> wrote:
> 
> -dev, +user (this is not a question about development of Spark itself so 
> you’ll get more answers in the user mailing list)
> 
> First up let me say that I don’t really know how this could be done - I’m 
> sure it would be possible with enough tinkering but it’s not clear what you 
> are trying to achieve. Spark is a distributed processing system, it has 
> multiple JVMs running on different machines that each run a small part of the 
> overall processing. Unless you have some sort of idea to have multiple C++ 
> processes collocated with the distributed JVMs using named memory mapped 
> files doesn’t make architectural sense. 
> -------
> Robin East
> Spark GraphX in Action Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
> 
> 
> 
> 
> 
> On 6 Dec 2015, at 20:43, Jia <jacqueline...@gmail.com> wrote:
> 
> Dears, for one project, I need to implement something so Spark can read data 
> from a C++ process. 
> To provide high performance, I really hope to implement this through shared 
> memory between the C++ process and Java JVM process.
> It seems it may be possible to use named memory mapped files and JNI to do 
> this, but I wonder whether there is any existing efforts or more efficient 
> approach to do this?
> Thank you very much!
> 
> Best Regards,
> Jia
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 
> 
> 
> 
> 
> 



Re: Shared memory between C++ process and Spark

2015-12-07 Thread Jia
Thanks, Dewful!

My impression is that Tachyon is a very nice in-memory file system that can 
connect to multiple storages.
However, because our data is also hold in memory, I suspect that connecting to 
Spark directly may be more efficient in performance.
But definitely I need to look at Tachyon more carefully, in case it has a very 
efficient C++ binding mechanism.

Best Regards,
Jia

On Dec 7, 2015, at 11:46 AM, Dewful <dew...@gmail.com> wrote:

> Maybe looking into something like Tachyon would help, I see some sample c++ 
> bindings, not sure how much of the current functionality they support...
> 
> Hi, Robin, 
> Thanks for your reply and thanks for copying my question to user mailing list.
> Yes, we have a distributed C++ application, that will store data on each node 
> in the cluster, and we hope to leverage Spark to do more fancy analytics on 
> those data. But we need high performance, that’s why we want shared memory.
> Suggestions will be highly appreciated!
> 
> Best Regards,
> Jia
> 
> On Dec 7, 2015, at 10:54 AM, Robin East <robin.e...@xense.co.uk> wrote:
> 
>> -dev, +user (this is not a question about development of Spark itself so 
>> you’ll get more answers in the user mailing list)
>> 
>> First up let me say that I don’t really know how this could be done - I’m 
>> sure it would be possible with enough tinkering but it’s not clear what you 
>> are trying to achieve. Spark is a distributed processing system, it has 
>> multiple JVMs running on different machines that each run a small part of 
>> the overall processing. Unless you have some sort of idea to have multiple 
>> C++ processes collocated with the distributed JVMs using named memory mapped 
>> files doesn’t make architectural sense. 
>> ---
>> Robin East
>> Spark GraphX in Action Michael Malak and Robin East
>> Manning Publications Co.
>> http://www.manning.com/books/spark-graphx-in-action
>> 
>> 
>> 
>> 
>> 
>>> On 6 Dec 2015, at 20:43, Jia <jacqueline...@gmail.com> wrote:
>>> 
>>> Dears, for one project, I need to implement something so Spark can read 
>>> data from a C++ process. 
>>> To provide high performance, I really hope to implement this through shared 
>>> memory between the C++ process and Java JVM process.
>>> It seems it may be possible to use named memory mapped files and JNI to do 
>>> this, but I wonder whether there is any existing efforts or more efficient 
>>> approach to do this?
>>> Thank you very much!
>>> 
>>> Best Regards,
>>> Jia
>>> 
>>> 
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>> 
>> 
> 



Re: Spark 1.5.1 Build Failure

2015-10-30 Thread Jia Zhan
Hi,

Have tried tried building it successfully without hadoop?

$build/mnv -DskiptTests clean package

Can you check it build/mvn was started successfully, or it's using your own
mvn? Let us know your jdk version as well.

On Thu, Oct 29, 2015 at 11:34 PM, Raghuveer Chanda <
raghuveer.cha...@gmail.com> wrote:

> Hi,
>
> I am trying to build spark 1.5.1 for hadoop 2.5 but I get the following
> error.
>
>
> *build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.3.2 -DskipTests
> clean package*
>
>
> [INFO] Spark Project Parent POM ... SUCCESS [
>  9.812 s]
> [INFO] Spark Project Launcher . SUCCESS [
> 27.701 s]
> [INFO] Spark Project Networking ... SUCCESS [
> 16.721 s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
>  8.617 s]
> [INFO] Spark Project Unsafe ... SUCCESS [
> 27.124 s]
> [INFO] Spark Project Core . FAILURE [09:08
> min]
>
> Failed to execute goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile
> (scala-test-compile-first) on project spark-core_2.10: Execution
> scala-test-compile-first of goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile failed.
> CompileFailed -> [Help 1]
>
>
>
> --
> Regards,
> Raghuveer Chanda
>
>


-- 
Jia Zhan


Re: How does Spark coordinate with Tachyon wrt data locality

2015-10-23 Thread Calvin Jia
Hi Shane,

Tachyon provides an api to get the block locations of the file which Spark
uses when scheduling tasks.

Hope this helps,
Calvin

On Fri, Oct 23, 2015 at 8:15 AM, Kinsella, Shane 
wrote:

> Hi all,
>
>
>
> I am looking into how Spark handles data locality wrt Tachyon. My main
> concern is how this is coordinated. Will it send a task based on a file
> loaded from Tachyon to a node that it knows has that file locally and how
> does it know which nodes has what?
>
>
>
> Kind regards,
>
> Shane
> This email (including any attachments) is proprietary to Aspect Software,
> Inc. and may contain information that is confidential. If you have received
> this message in error, please do not read, copy or forward this message.
> Please notify the sender immediately, delete it from your system and
> destroy any copies. You may not further disclose or distribute this email
> or its attachments.
>


Re: In-memory computing and cache() in Spark

2015-10-19 Thread Jia Zhan
Hi Igor,

It iterative conducts reduce((a,b)*=>*a+b) which is the action there. I can
see clearly 4 stages (one saveAsTextFile() and three Reduce()) in the web
UI. Don't know what's going there that causes the non-intuitive caching
behavior.

Thanks for help!

On Sun, Oct 18, 2015 at 11:32 PM, Igor Berman <igor.ber...@gmail.com> wrote:

> Does ur iterations really submit job? I dont see any action there
> On Oct 17, 2015 00:03, "Jia Zhan" <zhanjia...@gmail.com> wrote:
>
>> Hi all,
>>
>> I am running Spark locally in one node and trying to sweep the memory
>> size for performance tuning. The machine has 8 CPUs and 16G main memory,
>> the dataset in my local disk is about 10GB. I have several quick questions
>> and appreciate any comments.
>>
>> 1. Spark performs in-memory computing, but without using RDD.cache(),
>> will anything be cached in memory at all? My guess is that, without
>> RDD.cache(), only a small amount of data will be stored in OS buffer cache,
>> and every iteration of computation will still need to fetch most data from
>> disk every time, is that right?
>>
>> 2. To evaluate how caching helps with iterative computation, I wrote a
>> simple program as shown below, which basically consists of one saveAsText()
>> and three reduce() actions/stages. I specify "spark.driver.memory" to
>> "15g", others by default. Then I run three experiments.
>>
>> *   val* *conf* = *new* *SparkConf*().setAppName(*"wordCount"*)
>>
>>*val* *sc* = *new* *SparkContext*(conf)
>>
>>*val* *input* = sc.textFile(*"/InputFiles"*)
>>
>>   *val* *words* = input.flatMap(line *=>* line.split(*" "*)).map(word
>> *=>* (word, *1*)).reduceByKey(_+_).saveAsTextFile(*"/OutputFiles"*)
>>
>>   *val* *ITERATIONS* = *3*
>>
>>   *for* (i *<-* *1* to *ITERATIONS*) {
>>
>>   *val* *totallength* = input.filter(line*=>*line.contains(
>> *"the"*)).map(s*=>*s.length).reduce((a,b)*=>*a+b)
>>
>>   }
>>
>> (I) The first run: no caching at all. The application finishes in ~12
>> minutes (2.6min+3.3min+3.2min+3.3min)
>>
>> (II) The second run, I modified the code so that the input will be
>> cached:
>>  *val input = sc.textFile("/InputFiles").cache()*
>>  The application finishes in ~11 mins!! (5.4min+1.9min+1.9min+2.0min)!
>>  The storage page in Web UI shows 48% of the dataset  is cached,
>> which makes sense due to large java object overhead, and
>> spark.storage.memoryFraction is 0.6 by default.
>>
>> (III) However, the third run, same program as the second one, but I
>> changed "spark.driver.memory" to be "2g".
>>The application finishes in just 3.6 minutes (3.0min + 9s + 9s + 9s)!!
>> And UI shows 6% of the data is cached.
>>
>> *From the results we can see the reduce stages finish in seconds, how
>> could that happen with only 6% cached? Can anyone explain?*
>>
>> I am new to Spark and would appreciate any help on this. Thanks!
>>
>> Jia
>>
>>
>>
>>


-- 
Jia Zhan


Re: In-memory computing and cache() in Spark

2015-10-19 Thread Jia Zhan
Hi Sonal,

I tried changing the size spark.executor.memory but noting changes. It
seems when I run locally in one machine, the RDD is cached in driver memory
instead of executor memory. Here is a related post online:
http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-in-Local-Mode-td22279.html

When I change spark.driver.memory, I can see the change of cached data in
 web UI. Like I mentioned, when I set driver memory to 2G, it says 6% RDD
cached. When set to 15G, it says 48% RDD cached, but with much slower
speed!

On Sun, Oct 18, 2015 at 10:32 PM, Sonal Goyal <sonalgoy...@gmail.com> wrote:

> Hi Jia,
>
> RDDs are cached on the executor, not on the driver. I am assuming you are
> running locally and haven't changed spark.executor.memory?
>
> Sonal
> On Oct 19, 2015 1:58 AM, "Jia Zhan" <zhanjia...@gmail.com> wrote:
>
> Anyone has any clue what's going on.? Why would caching with 2g memory
> much faster than with 15g memory?
>
> Thanks very much!
>
> On Fri, Oct 16, 2015 at 2:02 PM, Jia Zhan <zhanjia...@gmail.com> wrote:
>
>> Hi all,
>>
>> I am running Spark locally in one node and trying to sweep the memory
>> size for performance tuning. The machine has 8 CPUs and 16G main memory,
>> the dataset in my local disk is about 10GB. I have several quick questions
>> and appreciate any comments.
>>
>> 1. Spark performs in-memory computing, but without using RDD.cache(),
>> will anything be cached in memory at all? My guess is that, without
>> RDD.cache(), only a small amount of data will be stored in OS buffer cache,
>> and every iteration of computation will still need to fetch most data from
>> disk every time, is that right?
>>
>> 2. To evaluate how caching helps with iterative computation, I wrote a
>> simple program as shown below, which basically consists of one saveAsText()
>> and three reduce() actions/stages. I specify "spark.driver.memory" to
>> "15g", others by default. Then I run three experiments.
>>
>> *   val* *conf* = *new* *SparkConf*().setAppName(*"wordCount"*)
>>
>>*val* *sc* = *new* *SparkContext*(conf)
>>
>>*val* *input* = sc.textFile(*"/InputFiles"*)
>>
>>   *val* *words* = input.flatMap(line *=>* line.split(*" "*)).map(word
>> *=>* (word, *1*)).reduceByKey(_+_).saveAsTextFile(*"/OutputFiles"*)
>>
>>   *val* *ITERATIONS* = *3*
>>
>>   *for* (i *<-* *1* to *ITERATIONS*) {
>>
>>   *val* *totallength* = input.filter(line*=>*line.contains(
>> *"the"*)).map(s*=>*s.length).reduce((a,b)*=>*a+b)
>>
>>   }
>>
>> (I) The first run: no caching at all. The application finishes in ~12
>> minutes (2.6min+3.3min+3.2min+3.3min)
>>
>> (II) The second run, I modified the code so that the input will be
>> cached:
>>  *val input = sc.textFile("/InputFiles").cache()*
>>  The application finishes in ~11 mins!! (5.4min+1.9min+1.9min+2.0min)!
>>  The storage page in Web UI shows 48% of the dataset  is cached,
>> which makes sense due to large java object overhead, and
>> spark.storage.memoryFraction is 0.6 by default.
>>
>> (III) However, the third run, same program as the second one, but I
>> changed "spark.driver.memory" to be "2g".
>>The application finishes in just 3.6 minutes (3.0min + 9s + 9s + 9s)!!
>> And UI shows 6% of the data is cached.
>>
>> *From the results we can see the reduce stages finish in seconds, how
>> could that happen with only 6% cached? Can anyone explain?*
>>
>> I am new to Spark and would appreciate any help on this. Thanks!
>>
>> Jia
>>
>>
>>
>>
>
>
> --
> Jia Zhan
>
>


-- 
Jia Zhan


Re: In-memory computing and cache() in Spark

2015-10-18 Thread Jia Zhan
Anyone has any clue what's going on.? Why would caching with 2g memory much
faster than with 15g memory?

Thanks very much!

On Fri, Oct 16, 2015 at 2:02 PM, Jia Zhan <zhanjia...@gmail.com> wrote:

> Hi all,
>
> I am running Spark locally in one node and trying to sweep the memory size
> for performance tuning. The machine has 8 CPUs and 16G main memory, the
> dataset in my local disk is about 10GB. I have several quick questions and
> appreciate any comments.
>
> 1. Spark performs in-memory computing, but without using RDD.cache(), will
> anything be cached in memory at all? My guess is that, without RDD.cache(),
> only a small amount of data will be stored in OS buffer cache, and every
> iteration of computation will still need to fetch most data from disk every
> time, is that right?
>
> 2. To evaluate how caching helps with iterative computation, I wrote a
> simple program as shown below, which basically consists of one saveAsText()
> and three reduce() actions/stages. I specify "spark.driver.memory" to
> "15g", others by default. Then I run three experiments.
>
> *   val* *conf* = *new* *SparkConf*().setAppName(*"wordCount"*)
>
>*val* *sc* = *new* *SparkContext*(conf)
>
>*val* *input* = sc.textFile(*"/InputFiles"*)
>
>   *val* *words* = input.flatMap(line *=>* line.split(*" "*)).map(word
> *=>* (word, *1*)).reduceByKey(_+_).saveAsTextFile(*"/OutputFiles"*)
>
>   *val* *ITERATIONS* = *3*
>
>   *for* (i *<-* *1* to *ITERATIONS*) {
>
>   *val* *totallength* = input.filter(line*=>*line.contains(*"the"*
> )).map(s*=>*s.length).reduce((a,b)*=>*a+b)
>
>   }
>
> (I) The first run: no caching at all. The application finishes in ~12
> minutes (2.6min+3.3min+3.2min+3.3min)
>
> (II) The second run, I modified the code so that the input will be cached:
>  *val input = sc.textFile("/InputFiles").cache()*
>  The application finishes in ~11 mins!! (5.4min+1.9min+1.9min+2.0min)!
>  The storage page in Web UI shows 48% of the dataset  is cached, which
> makes sense due to large java object overhead, and
> spark.storage.memoryFraction is 0.6 by default.
>
> (III) However, the third run, same program as the second one, but I
> changed "spark.driver.memory" to be "2g".
>The application finishes in just 3.6 minutes (3.0min + 9s + 9s + 9s)!!
> And UI shows 6% of the data is cached.
>
> *From the results we can see the reduce stages finish in seconds, how
> could that happen with only 6% cached? Can anyone explain?*
>
> I am new to Spark and would appreciate any help on this. Thanks!
>
> Jia
>
>
>
>


-- 
Jia Zhan


In-memory computing and cache() in Spark

2015-10-16 Thread Jia Zhan
Hi all,

I am running Spark locally in one node and trying to sweep the memory size
for performance tuning. The machine has 8 CPUs and 16G main memory, the
dataset in my local disk is about 10GB. I have several quick questions and
appreciate any comments.

1. Spark performs in-memory computing, but without using RDD.cache(), will
anything be cached in memory at all? My guess is that, without RDD.cache(),
only a small amount of data will be stored in OS buffer cache, and every
iteration of computation will still need to fetch most data from disk every
time, is that right?

2. To evaluate how caching helps with iterative computation, I wrote a
simple program as shown below, which basically consists of one saveAsText()
and three reduce() actions/stages. I specify "spark.driver.memory" to
"15g", others by default. Then I run three experiments.

*   val* *conf* = *new* *SparkConf*().setAppName(*"wordCount"*)

   *val* *sc* = *new* *SparkContext*(conf)

   *val* *input* = sc.textFile(*"/InputFiles"*)

  *val* *words* = input.flatMap(line *=>* line.split(*" "*)).map(word
*=>* (word, *1*)).reduceByKey(_+_).saveAsTextFile(*"/OutputFiles"*)

  *val* *ITERATIONS* = *3*

  *for* (i *<-* *1* to *ITERATIONS*) {

  *val* *totallength* = input.filter(line*=>*line.contains(*"the"*
)).map(s*=>*s.length).reduce((a,b)*=>*a+b)

  }

(I) The first run: no caching at all. The application finishes in ~12
minutes (2.6min+3.3min+3.2min+3.3min)

(II) The second run, I modified the code so that the input will be cached:
 *val input = sc.textFile("/InputFiles").cache()*
 The application finishes in ~11 mins!! (5.4min+1.9min+1.9min+2.0min)!
 The storage page in Web UI shows 48% of the dataset  is cached, which
makes sense due to large java object overhead, and
spark.storage.memoryFraction is 0.6 by default.

(III) However, the third run, same program as the second one, but I changed
"spark.driver.memory" to be "2g".
   The application finishes in just 3.6 minutes (3.0min + 9s + 9s + 9s)!!
And UI shows 6% of the data is cached.

*From the results we can see the reduce stages finish in seconds, how could
that happen with only 6% cached? Can anyone explain?*

I am new to Spark and would appreciate any help on this. Thanks!

Jia


Re: TTL for saveAsObjectFile()

2015-10-14 Thread Calvin Jia
Hi Antonio,

I don't think Spark provides a way to pass down params with
saveAsObjectFile. One way could be to pass a default TTL in the
configuration, but the approach doesn't make much sense since TTL is not
necessarily uniform.

Baidu will be talking about their use of TTL in Tachyon with Spark in this
meetup , which may be
helpful to understanding different ways to integrate.

Hope this helps,
Calvin

On Tue, Oct 13, 2015 at 1:07 PM, antoniosi  wrote:

> Hi,
>
> I am using RDD.saveAsObjectFile() to save the RDD dataset to Tachyon. In
> version 0.8, Tachyon will support for TTL for saved file. Is that supported
> from Spark as well? Is there a way I could specify an TTL for a saved
> object
> file?
>
> Thanks.
>
> Antonio.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/TTL-for-saveAsObjectFile-tp25051.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Can we gracefully kill stragglers in Spark SQL

2015-09-04 Thread Jia Zhan
Hello all,

I am new to Spark and have been working on a small project trying to tackle
the straggler problems. I ran some SQL queries (GROUPBY) on a small cluster
and observed that some tasks take several minutes while others finish in
seconds.

I know that Spark already has speculation mode but I still see this problem
with speculative mode turned on. Therefore, I modified the code to kill
those stragglers instead of re-executing them, trading accuracy for speed.
As expected, killing stragglers will cause system hang due to the lost
tasks. Can anyone give some guidance on getting this to work? Is it
possible to early terminate some tasks without affecting the overall
execution of the job, with some cost of accuracy?

Appreciate your help!

-- 
Jia Zhan


Re: Spark is in-memory processing, how then can Tachyon make Spark faster?

2015-08-07 Thread Calvin Jia
Hi,

Tachyon http://tachyon-project.org manages memory off heap which can help
prevent long GC pauses. Also, using Tachyon will allow the data to be
shared between Spark jobs if they use the same dataset.

Here's http://www.meetup.com/Tachyon/events/222485713/ a production use
case where Baidu runs Tachyon to get 30x performance improvement in their
SparkSQL workload.

Hope this helps,
Calvin

On Fri, Aug 7, 2015 at 9:42 AM, Muler mulugeta.abe...@gmail.com wrote:

 Spark is an in-memory engine and attempts to do computation in-memory.
 Tachyon is memory-centeric distributed storage, OK, but how would that help
 ran Spark faster?



Re: What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?

2015-06-16 Thread Jia Yu
Hi Peng,

I got exactly same error! My shuffle data is also very large. Have you
figured out a method to solve that?

Thanks,
Jia

On Fri, Apr 24, 2015 at 7:59 AM, Peng Cheng pc...@uow.edu.au wrote:

 I'm deploying a Spark data processing job on an EC2 cluster, the job is
 small
 for the cluster (16 cores with 120G RAM in total), the largest RDD has only
 76k+ rows. But heavily skewed in the middle (thus requires repartitioning)
 and each row has around 100k of data after serialization. The job always
 got
 stuck in repartitioning. Namely, the job will constantly get following
 errors and retries:

 org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
 location for shuffle

 org.apache.spark.shuffle.FetchFailedException: Error in opening
 FileSegmentManagedBuffer

 org.apache.spark.shuffle.FetchFailedException:
 java.io.FileNotFoundException: /tmp/spark-...
 I've tried to identify the problem but it seems like both memory and disk
 consumption of the machine throwing these errors are below 50%. I've also
 tried different configurations, including:

 let driver/executor memory use 60% of total memory.
 let netty to priortize JVM shuffling buffer.
 increase shuffling streaming buffer to 128m.
 use KryoSerializer and max out all buffers
 increase shuffling memoryFraction to 0.4
 But none of them works. The small job always trigger the same series of
 errors and max out retries (upt to 1000 times). How to troubleshoot this
 thing in such situation?

 Thanks a lot if you have any clue.




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/What-are-the-likely-causes-of-org-apache-spark-shuffle-MetadataFetchFailedException-Missing-an-outpu-tp22646.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Help!!!Map or join one large datasets then suddenly remote Akka client disassociated

2015-06-15 Thread Jia Yu
Hi folks,

Help me! I met a very weird problem. I really need some help!! Here is my
situation:

Case: Assign keys to two datasets (one is 96GB with 2.7 billion records and
one 1.5GB with 30k records) via MapPartitions first, and join them together
with their keys.

Environment:

Standalone Spark on Amazon EC2
Master*1 13GB 8 cores
Worker*16  each one 13GB 8 cores


(After met this problem, I switched to
Worker*16  each one 59GB 8 cores)


Read and write on HDFS (same cluster)
--
Problem:

At the beginning:---

The MapPartitions looks no problem. But when Spark does the Join for two
datasets, the console says

*ERROR TaskSchedulerImpl: Lost executor 4 on
ip-172-31-27-174.us-west-2.compute.internal: remote Akka client
disassociated*

Then I go back to this worker and check its log

There is something like Master said remote Akka client disassociated and
asked to kill executor *** and then the worker killed this executor

(Sorry I deleted that log and just remember the content.)

There is no other errors before the Akka client disassociated (for both of
master and worker).

Then ---

I tried one 62GB dataset with the 1.5 GB dataset. My job worked
smoothly. *HOWEVER,
I found one thing: If I set the spark.shuffle.memoryFraction to Zero, same
error will happen on this 62GB dataset.*

Then ---

I switched my workers to Worker*16  each one 59GB 8 cores. Error even
happened when Spark does the MapPartitions

Some metrics I
found

*When I do the MapPartitions or Join with 96GB data, its shuffle write is
around 100GB. And I cached 96GB data and its size is around 530GB.*

*Garbage collection time for 96GB dataset when Spark does the Map or Join
is around 12 second.*

My analysis--

This problem might be caused by large shuffle write data. The large shuffle
write caused high I/O on disk. If the shuffle write cannot be done by some
timeout period, then the master will think this executor is disassociated.

But I don't know how to solve this problem.

---


Any help will be appreciated!!!

Thanks,
Jia


Re: Spark Cluster Benchmarking Frameworks

2015-06-03 Thread Zhen Jia
Hi Jonathan,
Maybe you can try BigDataBench.  http://prof.ict.ac.cn/BigDataBench/
http://prof.ict.ac.cn/BigDataBench/  . It provides lots of workloads,
including both Hadoop and Spark based workloads.

Zhen Jia

hodgesz wrote
 Hi Spark Experts,
 
 I am curious what people are using to benchmark their Spark clusters.  We
 are about to start a build (bare metal) vs buy (AWS/Google Cloud/Qubole)
 project to determine our Hadoop and Spark deployment selection.  On the
 Hadoop side we will test live workloads as well as simulated ones with
 frameworks like TestDFSIO, TeraSort, MRBench, GridMix, etc.
 
 Do any equivalent benchmarking frameworks exist for Spark?  A quick Google
 search yielded https://github.com/databricks/spark-perf which looks pretty
 interesting.  It would be great to hear what others are doing here.
 
 Thanks for the help!
 
 Jonathan





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Cluster-Benchmarking-Frameworks-tp12699p23146.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL 1.3.1 saveAsParquetFile will output tachyon file with different block size

2015-04-28 Thread Calvin Jia
Hi,

You can apply this patch https://github.com/apache/spark/pull/5354 and
recompile.

Hope this helps,
Calvin

On Tue, Apr 28, 2015 at 1:19 PM, sara mustafa eng.sara.must...@gmail.com
wrote:

 Hi Zhang,

 How did you compile Spark 1.3.1 with Tachyon? when i changed Tachyon
 version
 to 0.6.3 in core/pom.xml, make-distribution.sh and try to compile again,
 many compilation errors raised.

 Thanks,




 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-1-3-1-saveAsParquetFile-will-output-tachyon-file-with-different-block-size-tp11561p11870.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Cannot change the memory of workers

2015-04-07 Thread Jia Yu
Hi guys,

Currently I am running Spark program on Amazon EC2. Each worker has around
(less than but near to )2 gb memory.

By default, I can see each worker is allocated 976 mb memory as the table
shows below on Spark WEB UI. I know this value is from (Total memory minus
1 GB). But I want more than 1 gb in each of my worker.

AddressStateCoresMemory

ALIVE1 (0 Used)976.0 MB (0.0 B Used)Based on the instruction on Spark
website, I made export SPARK_WORKER_MEMORY=1g in spark-env.sh. But it
doesn't work. BTW, I can set SPARK_EXECUTOR_MEMORY=1g and it works.

Can anyone help me? Is there a requirement that one worker must maintain 1
gb memory for itself aside from the memory for Spark?

Thanks,
Jia


Re: LogisticRegressionWithLBFGS shows ERRORs

2015-03-16 Thread Chang-Jia Wang
I just used random numbers.(My ML lib was spark-mllib_2.10-1.2.1)Please see the attached log. In the middle of the log, I dumped the data set before feeding into LogisticRegressionWithLBFGS. The first column false/true was the label (attribute “a”), and columns 2-5 (attributes “x”, “y”, “z”, and “i”) were the features. The 6th column was just row ID and was not used.The relationship was arbitrarily: a = (0.3 * x + 0.5 * y - 0.2 *z  0.4)After that you can find LBFGS was doing its job and then pumped out the error messages.The model showed coefficients:396.57624765427323, x662.7969020937115, y-259.0975519038385, z12.352037503257826, i-538.8516249699426, @aThe last one was the intercept. As you can see, the model seemed close enough.After that I fed the same data back to the model to see how the predictions worked.  (here attribute “a” was the prediction and “aa” was the original label) I only displayed 20 rows.The error rate showed 2 errors out of 1000.count(INTEGER), errorRate(DOUBLE), countDiff(INTEGER)key=[], rows=11000, 0.002000949949026, 2So, the algorithm worked, just spitting out the errors was kind of annoying. If this is not result affecting, maybe it should be warning or info.C.J.

bleep.log
Description: Binary data
On Mar 15, 2015, at 12:42 AM, DB Tsai dbt...@dbtsai.com wrote:In LBFGS version of logistic regression, the data is properlystandardized, so this should not happen. Can you provide a copy ofyour dataset to us so we can test it? If the dataset can not bepublic, can you have just send me a copy so I can dig into this? I'mthe author of LORWithLBFGS. Thanks.Sincerely,DB Tsai---Blog: https://www.dbtsai.comOn Fri, Mar 13, 2015 at 2:41 PM, cjwang c...@cjwang.us wrote:I am running LogisticRegressionWithLBFGS. I got these lines on my console:2015-03-12 17:38:03,897 ERROR breeze.optimize.StrongWolfeLineSearch |Encountered bad values in function evaluation. Decreasing step size to 0.52015-03-12 17:38:03,967 ERROR breeze.optimize.StrongWolfeLineSearch |Encountered bad values in function evaluation. Decreasing step size to 0.252015-03-12 17:38:04,036 ERROR breeze.optimize.StrongWolfeLineSearch |Encountered bad values in function evaluation. Decreasing step size to 0.1252015-03-12 17:38:04,105 ERROR breeze.optimize.StrongWolfeLineSearch |Encountered bad values in function evaluation. Decreasing step size to0.06252015-03-12 17:38:04,176 ERROR breeze.optimize.StrongWolfeLineSearch |Encountered bad values in function evaluation. Decreasing step size to0.031252015-03-12 17:38:04,247 ERROR breeze.optimize.StrongWolfeLineSearch |Encountered bad values in function evaluation. Decreasing step size to0.0156252015-03-12 17:38:04,317 ERROR breeze.optimize.StrongWolfeLineSearch |Encountered bad values in function evaluation. Decreasing step size to0.00781252015-03-12 17:38:04,461 ERROR breeze.optimize.StrongWolfeLineSearch |Encountered bad values in function evaluation. Decreasing step size to0.0058593752015-03-12 17:38:04,605 INFO breeze.optimize.StrongWolfeLineSearch | Linesearch t: NaN fval: NaN rhs: NaN cdd: NaN2015-03-12 17:38:04,672 INFO breeze.optimize.StrongWolfeLineSearch | Linesearch t: NaN fval: NaN rhs: NaN cdd: NaN2015-03-12 17:38:04,747 INFO breeze.optimize.StrongWolfeLineSearch | Linesearch t: NaN fval: NaN rhs: NaN cdd: NaN2015-03-12 17:38:04,818 INFO breeze.optimize.StrongWolfeLineSearch | Linesearch t: NaN fval: NaN rhs: NaN cdd: NaN2015-03-12 17:38:04,890 INFO breeze.optimize.StrongWolfeLineSearch | Linesearch t: NaN fval: NaN rhs: NaN cdd: NaN2015-03-12 17:38:04,962 INFO breeze.optimize.StrongWolfeLineSearch | Linesearch t: NaN fval: NaN rhs: NaN cdd: NaN2015-03-12 17:38:05,038 INFO breeze.optimize.StrongWolfeLineSearch | Linesearch t: NaN fval: NaN rhs: NaN cdd: NaN2015-03-12 17:38:05,107 INFO breeze.optimize.StrongWolfeLineSearch | Linesearch t: NaN fval: NaN rhs: NaN cdd: NaN2015-03-12 17:38:05,186 INFO breeze.optimize.StrongWolfeLineSearch | Linesearch t: NaN fval: NaN rhs: NaN cdd: NaN2015-03-12 17:38:05,256 INFO breeze.optimize.StrongWolfeLineSearch | Linesearch t: NaN fval: NaN rhs: NaN cdd: NaN2015-03-12 17:38:05,257 ERROR breeze.optimize.LBFGS | Failure! Resettinghistory: breeze.optimize.FirstOrderException: Line search zoom failedWhat causes them and how do I fix them?I checked my data and there seemed nothing out of the ordinary. Theresulting prediction model seemed acceptable to me. So, are these ERRORsactually WARNINGs? Could we or should we tune the level of these messagesdown one notch?--View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/LogisticRegressionWithLBFGS-shows-ERRORs-tp22042.htmlSent from the Apache Spark User List mailing list archive at Nabble.com.-To unsubscribe, e-mail: user-unsubscr...@spark.apache.orgFor additional commands, e-mail: user-h...@spark.apache.org