from:"Haoyuan Li"

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-12 Thread Haoyuan Li

This link should be helpful:
https://alluxio.org/docs/1.7/en/Running-Spark-on-Alluxio.html

Best regards,

Haoyuan (HY)

alluxio.com  | alluxio.org
 | powered
by Alluxio 


On Thu, Apr 12, 2018 at 6:32 PM, jb44  wrote:

> I'm running spark in LOCAL mode and trying to get it to talk to alluxio.
> I'm
> getting the error: java.lang.ClassNotFoundException: Class
> alluxio.hadoop.FileSystem not found
> The cause of this error is apparently that Spark cannot find the alluxio
> client jar in its classpath.
>
> I have looked at the page here:
> https://www.alluxio.org/docs/master/en/Debugging-Guide.
> html#q-why-do-i-see-exceptions-like-javalangruntimeexception-
> javalangclassnotfoundexception-class-alluxiohadoopfilesystem-not-found
>
> Which details the steps to take in this situation, but I'm not finding
> success.
>
> According to Spark documentation, I can instance a local Spark like so:
>
> SparkSession.builder
>   .appName("App")
>   .getOrCreate
>
> Then I can add the alluxio client library like so:
> sparkSession.conf.set("spark.driver.extraClassPath", ALLUXIO_SPARK_CLIENT)
> sparkSession.conf.set("spark.executor.extraClassPath",
> ALLUXIO_SPARK_CLIENT)
>
> I have verified that the proper jar file exists in the right location on my
> local machine with:
> logger.error(sparkSession.conf.get("spark.driver.extraClassPath"))
> logger.error(sparkSession.conf.get("spark.executor.extraClassPath"))
>
> But I still get the error. Is there anything else I can do to figure out
> why
> Spark is not picking the library up?
>
> Please note I am not using spark-submit - I am aware of the methods for
> adding the client jar to a spark-submit job. My Spark instance is being
> created as local within my application and this is the use case I want to
> solve.
>
> As an FYI there is another application in the cluster which is connecting
> to
> my alluxio using the fs client and that all works fine. In that case,
> though, the fs client is being packaged as part of the application through
> standard sbt dependencies.
>
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Writing files to s3 with out temporary directory

2017-11-22 Thread Haoyuan Li

This blog / tutorial
 maybe
helpful to run Spark in the Cloud with Alluxio.

Best regards,

Haoyuan

On Mon, Nov 20, 2017 at 2:12 PM, lucas.g...@gmail.com 
wrote:

> That sounds like allot of work and if I understand you correctly it sounds
> like a piece of middleware that already exists (I could be wrong).  Alluxio?
>
> Good luck and let us know how it goes!
>
> Gary
>
> On 20 November 2017 at 14:10, Jim Carroll  wrote:
>
>> Thanks. In the meantime I might just write a custom file system that maps
>> writes to parquet file parts to their final locations and then skips the
>> move.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>

Re: How to keep RDDs in memory between two different batch jobs?

2015-07-22 Thread Haoyuan Li

Yes. Tachyon can handle this well: http://tachyon-project.org/

Best,

Haoyuan

On Wed, Jul 22, 2015 at 10:56 AM, swetha swethakasire...@gmail.com wrote:

 Hi,

 We have a requirement wherein we need to keep RDDs in memory between Spark
 batch processing that happens every one hour. The idea here is to have RDDs
 that have active user sessions in memory between two jobs so that once a
 job
 processing is  done and another job is run after an hour the RDDs with
 active sessions are still available for joining with those in the current
 job. So, what do we need to keep the data in memory in between two batch
 jobs? Can we use Tachyon?

 Thanks,
 Swetha



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-RDDs-in-memory-between-two-different-batch-jobs-tp23957.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Haoyuan Li
CEO, Tachyon Nexus http://www.tachyonnexus.com/

Re: How to stop making Multiple copies in memory when running multiple Spark jobs?

2015-07-05 Thread Haoyuan Li

You can also find more info here:
http://tachyon-project.org/master/Running-Spark-on-Tachyon.html

Hope this helps.

Haoyuan

On Tue, Jun 30, 2015 at 11:28 PM, Himanshu Mehra 
himanshumehra@gmail.com wrote:

 Hi neprasad,

 You should give a try to Tachyon system. or any other in memory db. here
 you
 have the  link
 
 http://www.slideshare.net/haoyuanli/tachyon20141121ampcamp5-41881671?related=1
 
 of a slideshow explaining all this. what you want is not 10th slide but the
 whole slide is worth reading.

 Thank you.

 Himanshu



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-stop-making-Multiple-copies-in-memory-when-running-multiple-Spark-jobs-tp23534p23562.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Fast big data analytics with Spark on Tachyon in Baidu

2015-05-12 Thread Haoyuan Li

Dear all,

We’re organizing a meetup http://www.meetup.com/Tachyon/events/222485713/ on
May 28th at IBM in Forster City that might be of interest to the Spark
community. The focus is a production use case of Spark and Tachyon at Baidu.

You can sign up here: http://www.meetup.com/Tachyon/events/222485713/

Hope some of you can make it!

Best,

Haoyuan

Re: deployment of spark on mesos and data locality in tachyon/hdfs

2015-04-01 Thread Haoyuan Li

Response inline.

On Tue, Mar 31, 2015 at 10:41 PM, Sean Bigdatafun sean.bigdata...@gmail.com
 wrote:

 (resending...)

 I was thinking the same setup… But the more I think of this problem, and
 the more interesting this could be.

 If we allocate 50% total memory to Tachyon statically, then the Mesos
 benefits of dynamically scheduling resources go away altogether.


People can still benefits from Mesos' dynamically scheduling of the rest
memory as well as compute resource.



 Can Tachyon be resource managed by Mesos (dynamically)? Any thought or
 comment?



This requires some integration work.

Best,

Haoyuan



 Sean





 Hi Haoyuan,

 So on each mesos slave node I should allocate/section off some amount
 of memory for tachyon (let's say 50% of the total memory) and the rest
 for regular mesos tasks?

 This means, on each slave node I would have tachyon worker (+ hdfs
 configuration to talk to s3 or the hdfs datanode) and the mesos slave
 ?process. Is this correct?





 --
 --Sean





-- 
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/

Re: deployment of spark on mesos and data locality in tachyon/hdfs

2015-03-31 Thread Haoyuan Li

Tachyon should be co-located with Spark in this case.

Best,

Haoyuan

On Tue, Mar 31, 2015 at 4:30 PM, Ankur Chauhan achau...@brightcove.com
wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Hi,

 I am fairly new to the spark ecosystem and I have been trying to setup
 a spark on mesos deployment. I can't seem to figure out the best
 practices around HDFS and Tachyon. The documentation about Spark's
 data-locality section seems to point that each of my mesos slave nodes
 should also run a hdfs datanode. This seems fine but I can't seem to
 figure out how I would go about pushing tachyon into the mix.

 How should i organize my cluster?
 Should tachyon be colocated on my mesos worker nodes? or should all
 the spark jobs reach out to a separate hdfs/tachyon cluster.

 - -- Ankur Chauhan
 -BEGIN PGP SIGNATURE-

 iQEcBAEBAgAGBQJVGy4bAAoJEOSJAMhvLp3L5bkH/0MECyZkh3ptWzmsNnSNfGWp
 Oh93TUfD+foXO2ya9D+hxuyAxbjfXs/68aCWZsUT6qdlBQU9T1vX+CmPOnpY1KPN
 NJP3af+VK0osaFPo6k28OTql1iTnvb9Nq+WDlohxBC/hZtoYl4cVxu8JmRlou/nb
 /wfpp0ShmJnlxsoPa6mVdwzjUjVQAfEpuet3Ow5veXeA9X7S55k/h0ZQrZtO8eXL
 jJsKaT8ne9WZPhZwA4PkdzTxkXF3JNveCIKPzNttsJIaLlvd0nLA/wu6QWmxskp6
 iliGSmEk5P1zZWPPnk+TPIqbA0Ttue7PeXpSrbA9+pYiNT4R/wAneMvmpTABuR4=
 =8ijP
 -END PGP SIGNATURE-

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/

Re: StorageLevel: OFF_HEAP

2015-03-18 Thread Haoyuan Li

Did you recompile it with Tachyon 0.6.0?

Also, Tachyon 0.6.1 has been released this morning:
http://tachyon-project.org/ ; https://github.com/amplab/tachyon/releases

Best regards,

Haoyuan

On Wed, Mar 18, 2015 at 11:48 AM, Ranga sra...@gmail.com wrote:

 I just tested with Spark-1.3.0 + Tachyon-0.6.0 and still see the same
 issue. Here are the logs:
 15/03/18 11:44:11 ERROR : Invalid method name: 'getDataFolder'
 15/03/18 11:44:11 ERROR : Invalid method name: 'user_getFileId'
 15/03/18 11:44:11 ERROR storage.TachyonBlockManager: Failed 10 attempts to
 create tachyon dir in
 /tmp_spark_tachyon/spark-e3538a20-5e42-48a4-ad67-4b97aded90e4/driver

 Thanks for any other pointers.


 - Ranga



 On Wed, Mar 18, 2015 at 9:53 AM, Ranga sra...@gmail.com wrote:

 Thanks for the information. Will rebuild with 0.6.0 till the patch is
 merged.

 On Tue, Mar 17, 2015 at 7:24 PM, Ted Yu yuzhih...@gmail.com wrote:

 Ranga:
 Take a look at https://github.com/apache/spark/pull/4867

 Cheers

 On Tue, Mar 17, 2015 at 6:08 PM, fightf...@163.com fightf...@163.com
 wrote:

 Hi， Ranga

 That's true. Typically a version mis-match issue. Note that spark 1.2.1
 has tachyon built in with version 0.5.0 , I think you may need to rebuild
 spark
 with your current tachyon release.
 We had used tachyon for several of our spark projects in a production
 environment.

 Thanks,
 Sun.

 --
 fightf...@163.com


 *From:* Ranga sra...@gmail.com
 *Date:* 2015-03-18 06:45
 *To:* user@spark.apache.org
 *Subject:* StorageLevel: OFF_HEAP
 Hi

 I am trying to use the OFF_HEAP storage level in my Spark (1.2.1)
 cluster. The Tachyon (0.6.0-SNAPSHOT) nodes seem to be up and running.
 However, when I try to persist the RDD, I get the following error:

 ERROR [2015-03-16 22:22:54,017] ({Executor task launch worker-0}
 TachyonFS.java[connect]:364)  - Invalid method name:
 'getUserUnderfsTempFolder'
 ERROR [2015-03-16 22:22:54,050] ({Executor task launch worker-0}
 TachyonFS.java[getFileId]:1020)  - Invalid method name: 'user_getFileId'

 Is this because of a version mis-match?

 On a different note, I was wondering if Tachyon has been used in a
 production environment by anybody in this group?

 Appreciate your help with this.


 - Ranga







-- 
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/

Re: Spark or Tachyon: capture data lineage

2015-01-02 Thread Haoyuan Li

Jerry,

Great question. Spark and Tachyon capture lineage information at different
granularities. We are working on an integration between Spark/Tachyon about
this. Hope to get it ready to be released soon.

Best,

Haoyuan

On Fri, Jan 2, 2015 at 12:24 PM, Jerry Lam chiling...@gmail.com wrote:

 Hi spark developers,

 I was thinking it would be nice to extract the data lineage information
 from a data processing pipeline. I assume that spark/tachyon keeps this
 information somewhere. For instance, a data processing pipeline uses
 datasource A and B to produce C. C is then used by another process to
 produce D and E. Asumming A, B, C, D, E are stored on disk, It would be so
 useful if there is a way to capture this information when we are using
 spark/tachyon to query this data lineage information. For example, give me
 datasets that produce E. It should give me  a graph like (A and B)-C-E.

 Is this something already possible with spark/tachyon? If not, do you
 think it is possible? Does anyone mind to share their experience in
 capturing the data lineage in a data processing pipeline?

 Best Regards,

 Jerry




-- 
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/

Re: Persist kafka streams to text file, tachyon error?

2014-11-22 Thread Haoyuan Li

StorageLevel.OFF_HEAP requires to run Tachyon:
http://spark.apache.org/docs/latest/programming-guide.html

If you don't know if you have tachyon or not, you probably don't :)
http://tachyon-project.org/

For local testing, you can use other persist() solutions without running
Tachyon.

Best,

Haoyuan

On Fri, Nov 21, 2014 at 2:48 PM, Joanne Contact joannenetw...@gmail.com
wrote:

 use the right email list.
 -- Forwarded message --
 From: Joanne Contact joannenetw...@gmail.com
 Date: Fri, Nov 21, 2014 at 2:32 PM
 Subject: Persist kafka streams to text file
 To: u...@spark.incubator.apache.org


 Hello I am trying to read kafka stream to a text file by running spark
 from my IDE (IntelliJ IDEA) . The code is similar as a previous thread on
 persisting stream to a text file.

 I am new to spark or scala. I believe the spark is on local mode as the
 console shows
 14/11/21 14:17:11 INFO spark.SparkContext: Spark configuration:
 spark.app.name=local-mode

  I got the following errors. It is related to Tachyon. But I don't know if
 I have tachyon or not.

 14/11/21 14:17:54 WARN storage.TachyonBlockManager: Attempt 1 to create
 tachyon dir null failed
 java.io.IOException: Failed to connect to master localhost/127.0.0.1:19998
 after 5 attempts
 at tachyon.client.TachyonFS.connect(TachyonFS.java:293)
 at tachyon.client.TachyonFS.getFileId(TachyonFS.java:1011)
 at tachyon.client.TachyonFS.exist(TachyonFS.java:633)
 at
 org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:117)
 at
 org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:106)
 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
 at
 org.apache.spark.storage.TachyonBlockManager.createTachyonDirs(TachyonBlockManager.scala:106)
 at
 org.apache.spark.storage.TachyonBlockManager.init(TachyonBlockManager.scala:57)
 at
 org.apache.spark.storage.BlockManager.tachyonStore$lzycompute(BlockManager.scala:88)
 at
 org.apache.spark.storage.BlockManager.tachyonStore(BlockManager.scala:82)
 at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:729)
 at
 org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:594)
 at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:145)
 at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
 at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: tachyon.org.apache.thrift.TException: Failed to connect to
 master localhost/127.0.0.1:19998 after 5 attempts
 at tachyon.master.MasterClient.connect(MasterClient.java:178)
 at tachyon.client.TachyonFS.connect(TachyonFS.java:290)
 ... 28 more
 Caused by: tachyon.org.apache.thrift.transport.TTransportException:
 java.net.ConnectException: Connection refused
 at tachyon.org.apache.thrift.transport.TSocket.open(TSocket.java:185)
 at
 tachyon.org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
 at tachyon.master.MasterClient.connect(MasterClient.java:156)
 ... 29 more
 Caused by: java.net.ConnectException: Connection refused
 at java.net.PlainSocketImpl.socketConnect(Native Method)
 at
 java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
 at
 java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
 at
 java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
 at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
 at java.net.Socket.connect(Socket.java:579)
 at tachyon.org.apache.thrift.transport.TSocket.open(TSocket.java:180)
 ... 31 more
 14/11/21 14:17:54 ERROR storage.TachyonBlockManager: Failed 10 attempts to
 create tachyon dir in
 /tmp_spark_tachyon/spark-3dbec68b-f5b8-45e1-bb68-370439839d4a/driver

 I looked at the code. It has the following part. Is that a problem?

 .persist(StorageLevel.OFF_HEAP)

 Any advice?

 Thank you!

 J




-- 
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/

Re: Saving very large data sets as Parquet on S3

2014-10-24 Thread Haoyuan Li

Daniel,

Currently, having Tachyon will at least help on the input part in this case.

Haoyuan

On Fri, Oct 24, 2014 at 2:01 PM, Daniel Mahler dmah...@gmail.com wrote:

I am trying to convert some json logs to Parquet and save them on S3.
In principle this is just

import org.apache.spark._
val sqlContext = new sql.SQLContext(sc)
val data = sqlContext.jsonFile(s3n://source/path/*/*,10e-8)
data.registerAsTable(data)
data.saveAsParquetFile(s3n://target/path)

This works fine for up to about a 10^9 records, but above that I start
having problems.
The first problem I encountered is that after the data file get written out
writing the Parquet summary file fails.
While I seem to have all the data saved out,
programs have a huge have a huge start up time
when processing parquet files without a summary file.

Writing the summary file appears to primarily depend
on on the number of partitions being written,
and relatively independent of the amount of being written.
Problems start after about a 1000 partitions,
writing 1 partitions fails even with repartitioned one days worth of
data.

My data is very finely partitioned, about 16 log files per hour, or 13K
files per month.
The file sizes are very uneven, ranging over several orders of magnitude.
There are several years of data.
By my calculations this will produce 10s of terabytes of Parquet files.

The first thing I tried to get around this problem
was passing the data through `coalesce(1000, shuffle=false)` before
writing.
This works up to about a month worth of data,
after that coalescing to 1000 partitions produces parquet files larger
than 5G
and writing to S3 fails as a result.
Also coalescing slows processing down by at least a factor of 2.
I do not understand why this should happen since I use shuffle=false.
AFAIK coalesce should just be a bookkeeping trick and the original
partitions should be processed pretty much the same as before, just with
their outputs concatenated.

The only other option I can think of is to write each month coalesced
as a separate data set with its own summary file
and union the RDDs when processing the data,
but I do not know how much overhead that will introduce.

I am looking for advice on the best way to save this size data in Parquet
on S3.
Apart from solving the the summary file issue i am also looking for ways
to improve performance.
Would it make sense to write the data to a local hdfs first and push it to
S3 with `hadoop distcp`?
Is putting Tachyon in front of either the input or the output S3 likely to
help?
If yes which is likely to help more?

I set options on the master as follows

+
cat EOF ~/spark/conf/spark-defaults.conf
spark.serializerorg.apache.spark.serializer.KryoSerializer
spark.rdd.compress true
spark.shuffle.consolidateFiles true
spark.akka.frameSize 20
EOF

copy-dir /root/spark/conf
spark/sbin/stop-all.sh
sleep 5
spark/sbin/start-all.
++

Does this make sense? Should I set some other options?
I have also asked these questions on StackOverflow where I reproduced the
full error messages:

+
http://stackoverflow.com/questions/26332542/how-to-save-a-multi-terabyte-schemardd-in-parquet-format-on-s3
+
http://stackoverflow.com/questions/26321947/multipart-uploads-to-amazon-s3-from-apache-spark
+
http://stackoverflow.com/questions/26291165/spark-sql-unable-to-complete-writing-parquet-data-with-a-large-number-of-shards

thanks
Daniel

--
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/

First Bay Area Tachyon meetup: August 25th, hosted by Yahoo! (Limited Space)

2014-08-19 Thread Haoyuan Li

Hi folks,

We've posted the first Tachyon meetup, which will be on August 25th and is
hosted by Yahoo! (Limited Space):
http://www.meetup.com/Tachyon/events/200387252/ . Hope to see you there!

Best,

Haoyuan

-- 
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/

Re: share/reuse off-heap persisted (tachyon) RDD in SparkContext or saveAsParquetFile on tachyon in SQLContext

2014-08-11 Thread Haoyuan Li

Is the speculative execution enabled?

Best,

Haoyuan


On Mon, Aug 11, 2014 at 8:08 AM, chutium teng@gmail.com wrote:

 sharing /reusing RDDs is always useful for many use cases, is this possible
 via persisting RDD on tachyon?

 such as off heap persist a named RDD into a given path (instead of
 /tmp_spark_tachyon/spark-xxx-xxx-xxx)
 or
 saveAsParquetFile on tachyon

 i tried to save a SchemaRDD on tachyon,

 val parquetFile =

 sqlContext.parquetFile(hdfs://test01.zala:8020/user/hive/warehouse/parquet_tables.db/some_table/)
 parquetFile.saveAsParquetFile(tachyon://test01.zala:19998/parquet_1)

 but always error, first error message is:

 14/08/11 16:19:28 INFO storage.BlockManagerInfo: Added broadcast_6_piece0
 in
 memory on test03.zala:37377 (size: 18.7 KB, free: 16.6 GB)
 14/08/11 16:20:06 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 3.0
 (TID 35, test04.zala): java.io.IOException:
 FailedToCheckpointException(message:Failed to rename
 hdfs://test01.zala:8020/tmp/tachyon/workers/140776003/31806/730 to
 hdfs://test01.zala:8020/tmp/tachyon/data/730)
 tachyon.worker.WorkerClient.addCheckpoint(WorkerClient.java:112)
 tachyon.client.TachyonFS.addCheckpoint(TachyonFS.java:168)
 tachyon.client.FileOutStream.close(FileOutStream.java:104)


 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)

 org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103)
 parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:321)


 parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:111)

 parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)

 org.apache.spark.sql.parquet.InsertIntoParquetTable.org
 $apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:259)


 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:272)


 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:272)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 org.apache.spark.scheduler.Task.run(Task.scala:54)

 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)


 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)


 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:722)



 hdfs://test01.zala:8020/tmp/tachyon/
 already chmod to 777, both owner and group is same as spark/tachyon startup
 user

 off-heap persist or saveAs normal text file on tachyon works fine.

 CDH 5.1.0, spark 1.1.0 snapshot, tachyon 0.6 snapshot



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/share-reuse-off-heap-persisted-tachyon-RDD-in-SparkContext-or-saveAsParquetFile-on-tachyon-in-SQLCont-tp11897.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/

Re: Spark LOCAL mode and external jar (extraClassPath)

Re: Writing files to s3 with out temporary directory

Re: How to keep RDDs in memory between two different batch jobs?

Re: How to stop making Multiple copies in memory when running multiple Spark jobs?

Fast big data analytics with Spark on Tachyon in Baidu

Re: deployment of spark on mesos and data locality in tachyon/hdfs

Re: deployment of spark on mesos and data locality in tachyon/hdfs

Re: StorageLevel: OFF_HEAP

Re: Spark or Tachyon: capture data lineage

Re: Persist kafka streams to text file, tachyon error?

Re: Saving very large data sets as Parquet on S3

First Bay Area Tachyon meetup: August 25th, hosted by Yahoo! (Limited Space)

Re: share/reuse off-heap persisted (tachyon) RDD in SparkContext or saveAsParquetFile on tachyon in SQLContext

13 matches

Site Navigation

Mail list logo

Footer information