authors, join our free online meetup
<https://go.alluxio.io/community-alluxio-day-2021> next Tuesday morning
(April 27) Pacific time.
Best,
- Bin Fan
reach out to me
Best regards
- Bin Fan
Hi everyone!
I am sharing this article about running Spark / Presto workloads on
AWS: Bursting
On-Premise Datalake Analytics and AI Workloads on AWS
<https://bit.ly/3qA1Tom> published on AWS blog. Hope you enjoy it. Feel
free to discuss with me here <https://alluxio.io/slack>.
- Bin
n-summit-2020/>.
The summit has speaker lineup spans creators and committers of Alluxio,
Spark, Presto, Tensorflow, K8s to data engineers and software engineers
building cloud-native data and AI platforms at Amazon, Alibaba, Comcast,
Facebook, Google, ING Bank, Microsoft, Tencent, and more!
- Bin Fan
Hi Spark Users,
Check out this blog on Building High-performance Data Lake using Apache
Hudi, Spark and Alluxio at T3Go <https://bit.ly/373RYPi>
<https://bit.ly/373RYPi>
Cheers
- Bin Fan
Try to deploy Alluxio as a caching layer on top of S3, providing Spark a
similar HDFS interface?
Like in this article:
https://www.alluxio.io/blog/accelerate-spark-and-hive-jobs-on-aws-s3-by-10x-with-alluxio-tiered-storage/
On Wed, May 27, 2020 at 6:52 PM Dark Crusader
wrote:
> Hi Randy,
>
>
as warnings or as errors in Alluxio master log? It
will be helpful to post the stack trace if it is available.
My hypothesis is that Spark in your case was testing creating such
directory
-Bin
On Wed, Aug 28, 2019 at 1:59 AM Mark Zhao wrote:
> Hey,
>
> When running Spark on Alluxio
] o.a.s.e.Executor - Adding
file:/tmp/spark-0365e48c-1747-4370-978f-7cd142ef0375/userFiles-3309dc5e-b6d0-4b76-a9aa-8e0a226ddab9/xxx.jar
to class loader
Thanks
Chen Bin
As a recipient of an email from Talend, your contact personal data will be on
our systems. Please see our contacts privacy notice
need more detailed instructions, feel free to join Alluxio community
channel https://slackin.alluxio.io <https://www.alluxio.io/slack>
- Bin Fan
alluxio.io <http://bit.ly/2JctWrJ> | powered by <http://bit.ly/2JdD0N2> | Data
Orchestration Summit 2019
<https://www.alluxio.io/data
'
\--conf
'spark.executor.extraJavaOptions=-Dalluxio.user.file.writetype.default=CACHE_THROUGH'
\...
Hope it helps
- Bin
On Tue, Sep 17, 2019 at 7:53 AM Mark Zhao wrote:
> Hi,
>
> If Spark applications write data into alluxio, can WriteType be configured?
>
> Thanks,
> Mark
>
>
/en/reference/Properties-List.html#alluxio.user.network.netty.timeout>
in
your Spark jobs.
Checkout how to run Spark with customized alluxio properties
<https://docs.alluxio.io/os/user/stable/en/compute/Spark.html?utm_source=spark_medium=mailinglist>
.
- Bin
On Thu, May 9, 2019 at 4:39 A
Spark/YARN
cluster.
Here is the documentation
<https://docs.alluxio.io/os/user/1.8/en/deploy/Running-Alluxio-On-Yarn.html?utm_source=spark>
about
deploying Alluxio with YARN.
- Bin
On Thu, May 9, 2019 at 4:19 AM u9g wrote:
> Hey,
>
> I want to speed up the Spark task runn
separate service (ideally colocated
with Spark servers), of course.
But also enables data sharing across Spark jobs.
- Bin
On Tue, Jan 15, 2019 at 10:29 AM Tomas Bartalos
wrote:
> Hello,
>
> I'm using spark-thrift server and I'm searching for best performing
> solution to query hot
g/docs/1.8/en/basic/Web-Interface.html#master-metrics>
.
If you see lower hit ratio, increase Alluxio storage size and vice versa.
Hope this helps,
- Bin
On Thu, Apr 4, 2019 at 9:29 PM Bin Fan wrote:
> Hi Andy,
>
> It really depends on your workloads. I would suggest to allocate 20% of
n/advanced/Alluxio-Storage-Management.html#configuring-alluxio-storage
)
- Bin
On Thu, Mar 21, 2019 at 8:26 AM u9g wrote:
> Hey,
>
> We have a cluster of 10 nodes each of which consists 128GB memory. We are
> about to running Spark and Alluxio on the cluster. We wonder how shall
unning
df.write.parquet(alluxioFilePath)
and your dataframes are stored in Alluxio as parquet files and you can
share them with more users.
One advantage with Alluxio here is you can manually free the cached data
from memory tier or
set the TTL for the cached data if you'd like more control on the data.
)
with Alluxio may benefit performance:
http://www.alluxio.com/2016/08/effective-spark-rdds-with-alluxio/
- Bin
On Mon, Sep 19, 2016 at 7:56 AM, aka.fe2s <aka.f...@gmail.com> wrote:
> Hi folks,
>
> What has happened with Tachyon / Alluxio in Spark 2? Doc doesn't mention
directory? or can it read from HBase in
parallel?
I don't see that many examples out there so any help or guidance will be
appreciated.
Also, we are using Cloudera Hadoop so there might be a slight delay with
the latest Spark release.
Best regards,
Bin
Here is one blog illustrating how to use Spark on Alluxio for this purpose.
Hope it will help:
http://www.alluxio.com/2016/04/getting-started-with-alluxio-and-spark/
On Mon, Jul 18, 2016 at 6:36 AM, Gene Pang wrote:
> Hi,
>
> If you want to use Alluxio with Spark 2.x, it
-on-Alluxio.html
- Bin
On Wed, Jun 29, 2016 at 8:40 AM, Sonal Goyal <sonalgoy...@gmail.com> wrote:
> Have you looked at Alluxio? (earlier tachyon)
>
> Best Regards,
> Sonal
> Founder, Nube Technologies <http://www.nubetech.co>
> Reifier at Strata Hadoop World
> <https:/
I use mapPartitions to open connections to Redis, I write it like this:
val seqs = lines.mapPartitions { lines =>
val cache = new RedisCache(redisUrl, redisPort)
val result = lines.map(line => Parser.parseBody(line, cache))
cache.redisPool.close
result
}
But it
BTW, "lines" is a DStream.
Bin Wang <wbi...@gmail.com>于2015年10月23日周五 下午2:16写道:
> I use mapPartitions to open connections to Redis, I write it like this:
>
> val seqs = lines.mapPartitions { lines =>
> val cache = new RedisCache(redisUrl, redisPort)
>
I'd like the spark application to be stopped gracefully while received kill
signal, so I add these code:
sys.ShutdownHookThread {
println("Gracefully stopping Spark Streaming Application")
ssc.stop(stopSparkContext = true, stopGracefully = true)
println("Application
I'm using Spark Streaming and there maybe some delays between batches. I'd
like to know is it possible to merge delayed batches into one batch to do
processing?
For example, the interval is set to 5 min but the first batch uses 1 hour,
so there are many batches delayed. In the end of processing
07.mbox/%3CCA+AHuK=xoy8dsdaobmgm935goqytaaqkpqsvdaqpmojottj...@mail.gmail.com%3E
>
> Thanks
> Best Regards
>
> On Thu, Sep 17, 2015 at 10:01 AM, Bin Wang <wbi...@gmail.com> wrote:
>
>> And here is another question. If I load the DStream from database every
>> time I start the job, will the dat
the values preloaded from DB
> 2. By cleaning the checkpoint in between upgrades, data is loaded
> only once
>
> Hope this helps,
> -adrian
>
> From: Bin Wang
> Date: Thursday, September 17, 2015 at 11:27 AM
> To: Akhil Das
> Cc: user
> Subject: Re: How t
torage (like a db or
> zookeeper etc) to keep the state (the indexes etc) and then when you deploy
> new code they can be easily recovered.
>
> Thanks
> Best Regards
>
> On Wed, Sep 16, 2015 at 3:52 PM, Bin Wang <wbi...@gmail.com> wrote:
>
>> I'd like to know if th
I'd like to know if there is a way to recovery dstream from checkpoint.
Because I stores state in DStream, I'd like the state to be recovered when
I restart the application and deploy new code.
And here is another question. If I load the DStream from database every
time I start the job, will the data be loaded when the job is failed and
auto restart? If so, both the checkpoint data and database data are loaded,
won't this a problem?
Bin Wang <wbi...@gmail.com>于2015年9月16日周三 下午
Hi,
I'm using spark streaming with kafka and I need to clear the offset and
re-compute all things. I deleted checkpoint directory in HDFS and reset
kafka offset with "kafka-run-class kafka.tools.ImportZkOffsets". I can
confirm the offset is set to 0 in kafka:
~ > kafka-run-class
receiver
for stream 0: Stopped by driver
Tathagata Das <t...@databricks.com>于2015年9月13日周日 下午4:05写道:
> Maybe the driver got restarted. See the log4j logs of the driver before it
> restarted.
>
> On Thu, Sep 10, 2015 at 11:32 PM, Bin Wang <wbi...@gmail.com> wrote:
>
&
I'm using spark streaming 1.4.0 and have a DStream that have all the data
it received. But today the history data in the DStream seems to be lost
suddenly. And the application UI also lost the streaming process time and
all the related data. Could any give some hint to debug this? Thanks.
16, 2015 1:33 PM, Bin Wang wbi...@gmail.com wrote:
If I write code like this:
val rdd = input.map(_.value)
val f1 = rdd.filter(_ == 1)
val f2 = rdd.filter(_ == 2)
...
Then the DAG of the execution may be this:
- Filter - ...
Map
- Filter - ...
But the two filters
If I write code like this:
val rdd = input.map(_.value)
val f1 = rdd.filter(_ == 1)
val f2 = rdd.filter(_ == 2)
...
Then the DAG of the execution may be this:
- Filter - ...
Map
- Filter - ...
But the two filters is operated on the same RDD, which means it could be
done by
I'm using spark streaming with Kafka, and submit it to YARN cluster with
mode yarn-cluster. But it hangs at SparkContext.start(). The Kafka config
is right since it can show some events in Streaming tab of web UI.
The attached file is the screen shot of the Jobs tab of web UI. The code
in the
Thanks for the help. I set --executor-cores and it works now. I've used
--total-executor-cores and don't realize it changed.
Tathagata Das t...@databricks.com于2015年7月10日周五 上午3:11写道:
1. There will be a long running job with description start() as that is
the jobs that is running the receivers.
of the streaming app.
On Wed, Jul 8, 2015 at 1:13 PM, Bin Wang wbi...@gmail.com wrote:
I'm writing a streaming application and want to use spark-submit to
submit it to a YARN cluster. I'd like to submit it in a client node and
exit spark-submit after the application is running. Is it possible
I'm writing a streaming application and want to use spark-submit to submit
it to a YARN cluster. I'd like to submit it in a client node and exit
spark-submit after the application is running. Is it possible?
am having a hard time adding
it to path.
This is the final spark-submit command I have but still have the class not
found error. Can anyone help me with this?
#!/bin/bash
export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
/bin/bash $SPARK_HOME/bin/spark-submit \
--master yarn-client
Hi Felix and Tomoas,
Thanks a lot for your information. I figured out the environment variable
PYSPARK_PYTHON is the secret key.
My current approach is to start iPython notebook on the namenode,
export PYSPARK_PYTHON=/opt/local/anaconda/bin/ipython
/opt/local/anaconda/bin/ipython notebook
at the top of my Python code
and use spark-submit to distribute it to the cluster. However, since I am
using iPython notebook, this is not available as an option.
Best,
Bin
application running on top of YARN
interactively in the iPython notebook:
Here is the code that I have written:
import sys
import os
from pyspark import SparkContext, SparkConf
sys.path.append('/home/hadoop/myuser/spark-1.3.1-bin-hadoop2.4/python')
sys.path.append('/home/hadoop/myuser/spark-1.3.1-bin
notebook
environment.
Best regards,
Bin
Hi All,
I am running a customized label propagation using Pregel. After a few
iterations, the program becomes slow and wastes a lot of time in mapPartitions
(at GraphImpl.scala:184 or VertexRDD.scala:318, or VertexRDD.scala:323). And
the amount of shuffle write reaches 15GB, while the size of
solutions?
Thanks a lot!
Best,
Bin
在 2014-08-06 04:54:39,Bin wubin_phi...@126.com 写道:
Hi All,
Finally I found that the problem occured when I called the graphx lib:
Exception in thread main java.lang.IllegalArgumentException: Can't zip RDDs
with unequal numbers of partitions
how come the partitions were unequal, and how I can control the
number of partitions of these RDD. Can someone give me some advice on this
problem?
Thanks very much!
Best,
Bin
Thanks for the advice. But since I am not the administrator of our spark
cluster, I can't do this. Is there any better solution based on the current
spark?
At 2014-08-01 02:38:15, shijiaxin shijiaxin...@gmail.com wrote:
Have you tried to write another similar function like edgeListFile in the
-program-hangs-at-job-finished-toarray-workers-throw-java-util-concurren.
I also toArray my data, which was the reason of his case.
However, how come it runs OK in local but not in the cluster? The memory of
each worker is over 60g, and my run command is:
$SPARK_HOME/bin/spark-class
and the stats
functions spark has already implemented are still on the roadmap. I am not
sure whether it will be good but might be something interesting to check
out.
/usr/bin
On Tue, Apr 8, 2014 at 9:57 AM, Flavio Pompermaier pomperma...@okkam.itwrote:
Hi to everybody,
in these days I looked a bit
...
Thanks,
Bin
On Tue, Mar 4, 2014 at 10:59 AM, Mayur Rustagi mayur.rust...@gmail.comwrote:
I have on cloudera vm
http://docs.sigmoidanalytics.com/index.php/How_to_Install_Spark_on_Cloudera_VM
which version are you trying to setup on cloudera.. also which cloudera
version are you using
?
assembly-plugin?..etc)
2. mvn install or mvn clean install or mvn install compile assembly:single?
3. after you have a jar file, then how do you execute the jar file instead
of using bin/run-example...
To answer those people who might ask what you have done
(Here is a derivative from
51 matches
Mail list logo