authors, join our free online meetup
<https://go.alluxio.io/community-alluxio-day-2021> next Tuesday morning
(April 27) Pacific time.
Best,
- Bin Fan
reach out to me
Best regards
- Bin Fan
Hi everyone!
I am sharing this article about running Spark / Presto workloads on
AWS: Bursting
On-Premise Datalake Analytics and AI Workloads on AWS
<https://bit.ly/3qA1Tom> published on AWS blog. Hope you enjoy it. Feel
free to discuss with me here <https://alluxio.io/slack>.
- Bin
n-summit-2020/>.
The summit has speaker lineup spans creators and committers of Alluxio,
Spark, Presto, Tensorflow, K8s to data engineers and software engineers
building cloud-native data and AI platforms at Amazon, Alibaba, Comcast,
Facebook, Google, ING Bank, Microsoft, Tencent, and more!
- Bin Fan
Hi Spark Users,
Check out this blog on Building High-performance Data Lake using Apache
Hudi, Spark and Alluxio at T3Go <https://bit.ly/373RYPi>
<https://bit.ly/373RYPi>
Cheers
- Bin Fan
Try to deploy Alluxio as a caching layer on top of S3, providing Spark a
similar HDFS interface?
Like in this article:
https://www.alluxio.io/blog/accelerate-spark-and-hive-jobs-on-aws-s3-by-10x-with-alluxio-tiered-storage/
On Wed, May 27, 2020 at 6:52 PM Dark Crusader
wrote:
> Hi Randy,
>
> Ye
as warnings or as errors in Alluxio master log? It
will be helpful to post the stack trace if it is available.
My hypothesis is that Spark in your case was testing creating such
directory
-Bin
On Wed, Aug 28, 2019 at 1:59 AM Mark Zhao wrote:
> Hey,
>
> When running Spark on Alluxio
] o.a.s.e.Executor - Adding
file:/tmp/spark-0365e48c-1747-4370-978f-7cd142ef0375/userFiles-3309dc5e-b6d0-4b76-a9aa-8e0a226ddab9/xxx.jar
to class loader
Thanks
Chen Bin
As a recipient of an email from Talend, your contact personal data will be on
our systems. Please see our contacts privacy notice
need more detailed instructions, feel free to join Alluxio community
channel https://slackin.alluxio.io <https://www.alluxio.io/slack>
- Bin Fan
alluxio.io <http://bit.ly/2JctWrJ> | powered by <http://bit.ly/2JdD0N2> | Data
Orchestration Summit 2019
<https://www.alluxio.io/data
ROUGH'
\--conf
'spark.executor.extraJavaOptions=-Dalluxio.user.file.writetype.default=CACHE_THROUGH'
\...
Hope it helps
- Bin
On Tue, Sep 17, 2019 at 7:53 AM Mark Zhao wrote:
> Hi,
>
> If Spark applications write data into alluxio, can WriteType be configured?
>
> Thanks,
> Mark
>
>
/en/reference/Properties-List.html#alluxio.user.network.netty.timeout>
in
your Spark jobs.
Checkout how to run Spark with customized alluxio properties
<https://docs.alluxio.io/os/user/stable/en/compute/Spark.html?utm_source=spark&utm_medium=mailinglist>
.
- Bin
On Thu, May 9, 2019 at 4:
Spark/YARN
cluster.
Here is the documentation
<https://docs.alluxio.io/os/user/1.8/en/deploy/Running-Alluxio-On-Yarn.html?utm_source=spark>
about
deploying Alluxio with YARN.
- Bin
On Thu, May 9, 2019 at 4:19 AM u9g wrote:
> Hey,
>
> I want to speed up the Spark task runn
separate service (ideally colocated
with Spark servers), of course.
But also enables data sharing across Spark jobs.
- Bin
On Tue, Jan 15, 2019 at 10:29 AM Tomas Bartalos
wrote:
> Hello,
>
> I'm using spark-thrift server and I'm searching for best performing
> solution to
g/docs/1.8/en/basic/Web-Interface.html#master-metrics>
.
If you see lower hit ratio, increase Alluxio storage size and vice versa.
Hope this helps,
- Bin
On Thu, Apr 4, 2019 at 9:29 PM Bin Fan wrote:
> Hi Andy,
>
> It really depends on your workloads. I would suggest to allocate 20% of
n/advanced/Alluxio-Storage-Management.html#configuring-alluxio-storage
)
- Bin
On Thu, Mar 21, 2019 at 8:26 AM u9g wrote:
> Hey,
>
> We have a cluster of 10 nodes each of which consists 128GB memory. We are
> about to running Spark and Alluxio on the cluster. We wonder how shall
> a
unning
df.write.parquet(alluxioFilePath)
and your dataframes are stored in Alluxio as parquet files and you can
share them with more users.
One advantage with Alluxio here is you can manually free the cached data
from memory tier or
set the TTL for the cached data if you'd like more control on the data.
)
with Alluxio may benefit performance:
http://www.alluxio.com/2016/08/effective-spark-rdds-with-alluxio/
- Bin
On Mon, Sep 19, 2016 at 7:56 AM, aka.fe2s wrote:
> Hi folks,
>
> What has happened with Tachyon / Alluxio in Spark 2? Doc doesn't mention
> it no longer.
>
> --
> Oleksiy Dyagilev
>
ectory? or can it read from HBase in
parallel?
I don't see that many examples out there so any help or guidance will be
appreciated.
Also, we are using Cloudera Hadoop so there might be a slight delay with
the latest Spark release.
Best regards,
Bin
Here is one blog illustrating how to use Spark on Alluxio for this purpose.
Hope it will help:
http://www.alluxio.com/2016/04/getting-started-with-alluxio-and-spark/
On Mon, Jul 18, 2016 at 6:36 AM, Gene Pang wrote:
> Hi,
>
> If you want to use Alluxio with Spark 2.x, it is recommended to write
-Alluxio.html
- Bin
On Wed, Jun 29, 2016 at 8:40 AM, Sonal Goyal wrote:
> Have you looked at Alluxio? (earlier tachyon)
>
> Best Regards,
> Sonal
> Founder, Nube Technologies <http://www.nubetech.co>
> Reifier at Strata Hadoop World
> <https://www.youtube.com/watch?v=eD3LkpPQ
BTW, "lines" is a DStream.
Bin Wang 于2015年10月23日周五 下午2:16写道:
> I use mapPartitions to open connections to Redis, I write it like this:
>
> val seqs = lines.mapPartitions { lines =>
> val cache = new RedisCache(redisUrl, redisPort)
> val result = lines.
I use mapPartitions to open connections to Redis, I write it like this:
val seqs = lines.mapPartitions { lines =>
val cache = new RedisCache(redisUrl, redisPort)
val result = lines.map(line => Parser.parseBody(line, cache))
cache.redisPool.close
result
}
But it see
the SparkConf
> "spark.streaming.stopGracefullyOnShutdown" to "true"
>
> Note to self, document this in the programming guide.
>
> On Wed, Sep 23, 2015 at 3:33 AM, Bin Wang wrote:
>
>> I'd like the spark application to be stoppe
I'd like the spark application to be stopped gracefully while received kill
signal, so I add these code:
sys.ShutdownHookThread {
println("Gracefully stopping Spark Streaming Application")
ssc.stop(stopSparkContext = true, stopGracefully = true)
println("Application stopped")
I'm using Spark Streaming and there maybe some delays between batches. I'd
like to know is it possible to merge delayed batches into one batch to do
processing?
For example, the interval is set to 5 min but the first batch uses 1 hour,
so there are many batches delayed. In the end of processing fo
m DB
> 2. By cleaning the checkpoint in between upgrades, data is loaded
> only once
>
> Hope this helps,
> -adrian
>
> From: Bin Wang
> Date: Thursday, September 17, 2015 at 11:27 AM
> To: Akhil Das
> Cc: user
> Subject: Re: How to recovery DStream from
j...@mail.gmail.com%3E
>
> Thanks
> Best Regards
>
> On Thu, Sep 17, 2015 at 10:01 AM, Bin Wang wrote:
>
>> And here is another question. If I load the DStream from database every
>> time I start the job, will the data be loaded when the job is failed and
>> auto
And here is another question. If I load the DStream from database every
time I start the job, will the data be loaded when the job is failed and
auto restart? If so, both the checkpoint data and database data are loaded,
won't this a problem?
Bin Wang 于2015年9月16日周三 下午8:40写道:
&
keeper etc) to keep the state (the indexes etc) and then when you deploy
> new code they can be easily recovered.
>
> Thanks
> Best Regards
>
> On Wed, Sep 16, 2015 at 3:52 PM, Bin Wang wrote:
>
>> I'd like to know if there is a way to recovery dstream from checkpoint
I'd like to know if there is a way to recovery dstream from checkpoint.
Because I stores state in DStream, I'd like the state to be recovered when
I restart the application and deploy new code.
I think I've found the reason. It seems that the the smallest offset is not
0 and I should not set the offset to 0.
Bin Wang 于2015年9月14日周一 下午2:46写道:
> Hi,
>
> I'm using spark streaming with kafka and I need to clear the offset and
> re-compute all things. I deleted checkp
Hi,
I'm using spark streaming with kafka and I need to clear the offset and
re-compute all things. I deleted checkpoint directory in HDFS and reset
kafka offset with "kafka-run-class kafka.tools.ImportZkOffsets". I can
confirm the offset is set to 0 in kafka:
~ > kafka-run-class kafka.tools.Consu
tered receiver
for stream 0: Stopped by driver
Tathagata Das 于2015年9月13日周日 下午4:05写道:
> Maybe the driver got restarted. See the log4j logs of the driver before it
> restarted.
>
> On Thu, Sep 10, 2015 at 11:32 PM, Bin Wang wrote:
>
>> I'm using spark streaming 1.4.0 and h
I'm using spark streaming 1.4.0 and have a DStream that have all the data
it received. But today the history data in the DStream seems to be lost
suddenly. And the application UI also lost the streaming process time and
all the related data. Could any give some hint to debug this? Thanks.
l 16, 2015 1:33 PM, "Bin Wang" wrote:
>
>> If I write code like this:
>>
>> val rdd = input.map(_.value)
>> val f1 = rdd.filter(_ == 1)
>> val f2 = rdd.filter(_ == 2)
>> ...
>>
>> Then the DAG of the execution may be this:
>>
>>
If I write code like this:
val rdd = input.map(_.value)
val f1 = rdd.filter(_ == 1)
val f2 = rdd.filter(_ == 2)
...
Then the DAG of the execution may be this:
-> Filter -> ...
Map
-> Filter -> ...
But the two filters is operated on the same RDD, which means it could be
done by
Thanks for the help. I set --executor-cores and it works now. I've used
--total-executor-cores and don't realize it changed.
Tathagata Das 于2015年7月10日周五 上午3:11写道:
> 1. There will be a long running job with description "start()" as that is
> the jobs that is running the receivers. It will never e
I'm using spark streaming with Kafka, and submit it to YARN cluster with
mode "yarn-cluster". But it hangs at SparkContext.start(). The Kafka config
is right since it can show some events in "Streaming" tab of web UI.
The attached file is the screen shot of the "Jobs" tab of web UI. The code
in th
for
> the lifetime of the streaming app.
>
> On Wed, Jul 8, 2015 at 1:13 PM, Bin Wang wrote:
>
>> I'm writing a streaming application and want to use spark-submit to
>> submit it to a YARN cluster. I'd like to submit it in a client node and
>> exit spar
I'm writing a streaming application and want to use spark-submit to submit
it to a YARN cluster. I'd like to submit it in a client node and exit
spark-submit after the application is running. Is it possible?
am having a hard time adding
it to path.
This is the final spark-submit command I have but still have the class not
found error. Can anyone help me with this?
#!/bin/bash
export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
/bin/bash $SPARK_HOME/bin/spark-submit \
--master yarn-client
Hi Felix and Tomoas,
Thanks a lot for your information. I figured out the environment variable
PYSPARK_PYTHON is the secret key.
My current approach is to start iPython notebook on the namenode,
export PYSPARK_PYTHON=/opt/local/anaconda/bin/ipython
/opt/local/anaconda/bin/ipython notebook
ot;#!/opt/local/anaconda" at the top of my Python code
and use spark-submit to distribute it to the cluster. However, since I am
using iPython notebook, this is not available as an option.
Best,
Bin
application running on top of YARN
interactively in the iPython notebook:
Here is the code that I have written:
import sys
import os
from pyspark import SparkContext, SparkConf
sys.path.append('/home/hadoop/myuser/spark-1.3.1-bin-hadoop2.4/python')
sys.path.append('/home/hadoop/myuser/
stalled on
every node. Should I install Anaconda Python on all of them? If so, what is
the modern way of managing the Python ecosystem on the cluster?
I am a big fan of Python so please guide me.
Best regards,
Bin
Python notebook
environment.
Best regards,
Bin
it related to the partition strategy? For
now, I used the default partition strategy.
Looking for advice!
Thanks very much!
Best,
Bin
Hi All,
I am running a customized label propagation using Pregel. After a few
iterations, the program becomes slow and wastes a lot of time in mapPartitions
(at GraphImpl.scala:184 or VertexRDD.scala:318, or VertexRDD.scala:323). And
the amount of shuffle write reaches 15GB, while the size of
ere any other better solutions?
Thanks a lot!
Best,
Bin
在 2014-08-06 04:54:39,"Bin" 写道:
Hi All,
Finally I found that the problem occured when I called the graphx lib:
"
Exception in thread "main" java.lang.IllegalArgumentException: Can't zip R
Graph.triplets.foreach(tri=>println())
"
Any advice?
Thanks a lot!
Best,
Bin
ut I couldn't think of a better way.
I am confused how come the partitions were unequal, and how I can control the
number of partitions of these RDD. Can someone give me some advice on this
problem?
Thanks very much!
Best,
Bin
.
Looking for advice!
Thanks a lot!
Best,
Bin
Thanks for the advice. But since I am not the administrator of our spark
cluster, I can't do this. Is there any better solution based on the current
spark?
At 2014-08-01 02:38:15, "shijiaxin" wrote:
>Have you tried to write another similar function like edgeListFile in the
>same file, and then
It seems that I cannot specify the weights. I have also tried to imitate
GraphLoader.edgeListFile, but I can't call The methods and class used in
GraphLoader.edgeListFile.
Have you successfully done this?
At 2014-08-01 12:47:08, "shijiaxin" wrote:
>I think you can try GraphLoader.edgeListFil
Hi Haiyang,
Thanks, it really is the reason.
Best,
Bin
在 2014-07-31 08:05:34,"Haiyang Fu" 写道:
Have you tried to increase the dirver memory?
On Thu, Jul 31, 2014 at 3:54 PM, Bin wrote:
Hi All,
The data size of my task is about 30mb. It runs smoothly in local mode.
Howev
edgeRDD, respectively. Then create the graph using Graph(vertices, edges).
I wonder whether there is a better way to do this?
Looking for advice!
Thanks very much!
Best,
Bin
-program-hangs-at-job-finished-toarray-workers-throw-java-util-concurren.
I also toArray my data, which was the reason of his case.
However, how come it runs OK in local but not in the cluster? The memory of
each worker is over 60g, and my run command is:
"$SPARK_HOME/bin/spark-
Hi All,
I wonder how to access a vertex via its vertexId? I need to get vertex's
attributes after running graph algorithm.
Thanks very much!
Best,
Bin
At least, Spark Streaming doesn't support Python at this moment, right?
On Mon, Apr 14, 2014 at 6:48 PM, Andrew Ash wrote:
> Hi Spark users,
>
> I've always done all my Spark work in Scala, but occasionally people ask
> about Python and its performance impact vs the same algorithm
> implementat
uster with Hbase preconfigured
and give it a try.
Sorry cannot provide more detailed explanation and help.
On Tue, Apr 8, 2014 at 10:17 AM, Flavio Pompermaier wrote:
> Thanks for the quick reply Bin. Phenix is something I'm going to try for
> sure but is seems somehow useless if I can
group and the "stats"
functions spark has already implemented are still on the roadmap. I am not
sure whether it will be good but might be something interesting to check
out.
/usr/bin
On Tue, Apr 8, 2014 at 9:57 AM, Flavio Pompermaier wrote:
> Hi to everybody,
>
> in the
?
assembly-plugin?..etc)
2. mvn install or mvn clean install or mvn install compile assembly:single?
3. after you have a jar file, then how do you execute the jar file instead
of using bin/run-example...
To answer those people who might ask what you have done
(Here is a derivative from the
uster...
Thanks,
Bin
On Tue, Mar 4, 2014 at 10:59 AM, Mayur Rustagi wrote:
> I have on cloudera vm
> http://docs.sigmoidanalytics.com/index.php/How_to_Install_Spark_on_Cloudera_VM
> which version are you trying to setup on cloudera.. also which cloudera
> version are you using...
&g
that you have done since you've already
made it!
Bin
On Mon, Mar 3, 2014 at 2:06 PM, Ognen Duzlevski wrote:
> I should add that in this setup you really do not need to look for the
> printout of the master node's IP - you set it yourself a priori. If anyone
> is interested
Hi there,
I have a CDH cluster set up, and I tried using the Spark parcel come with
Cloudera Manager, but it turned out they even don't have the run-example
shell command in the bin folder. Then I removed it from the cluster and
cloned the incubator-spark into the name node of my cluster
65 matches
Mail list logo