join but I
want to use broadcast join).
Or is there any ticket or roadmap for this feature?
Regards,
Shuai
From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Saturday, December 05, 2015 4:11 PM
To: Shuai Zheng
Cc: Jitesh chandra Mishra; user
Subject: Re: Broadcasting
Hi all,
Sorry to re-open this thread.
I have a similar issue, one big parquet file left outer join quite a few
smaller parquet files. But the running is extremely slow and even OOM sometimes
(with 300M , I have two questions here:
1, If I use outer join, will Spark SQL auto use
[mailto:jonathaka...@gmail.com]
Sent: Thursday, November 19, 2015 6:54 PM
To: Shuai Zheng
Cc: user
Subject: Re: Spark Tasks on second node never return in Yarn when I have more
than 1 task node
I don't know if this actually has anything to do with why your job is hanging,
but since you are using EMR you
Hi All,
I face a very weird case. I have already simplify the scenario to the most
so everyone can replay the scenario.
My env:
AWS EMR 4.1.0, Spark1.5
My code can run without any problem when I run it in a local mode, and it
has no problem when it run on a EMR cluster with one
of Spark.
So I want to know any new setup I should set in Spark 1.5 to make it work?
Regards,
Shuai
From: Shuai Zheng [mailto:szheng.c...@gmail.com]
Sent: Wednesday, November 04, 2015 3:22 PM
To: user@spark.apache.org
Subject: [Spark 1.5]: Exception in thread "broadcast-hash-j
Hi All,
I have a program which actually run a bit complex business (join) in spark.
And I have below exception:
I running on Spark 1.5, and with parameter:
spark-submit --deploy-mode client --executor-cores=24 --driver-memory=2G
--executor-memory=45G -class .
Some other setup:
Hi All,
I have 1000 files, from 500M to 1-2GB at this moment. And I want my spark
can read them as partition on the file level. Which means want the FileSplit
turn off.
I know there are some solutions, but not very good in my case:
1, I can't use WholeTextFiles method, because my file is
, 2015 11:07 AM
To: Shuai Zheng
Cc: user
Subject: Re: How to Take the whole file as a partition
You situation is special. It seems to me Spark may not fit well in your case.
You want to process the individual files (500M~2G) as a whole, you want good
performance.
You may want to write our
Hi All,
I try to access S3 file from S3 in Hadoop file format:
Below is my code:
Configuration hadoopConf = ctx.hadoopConfiguration();
hadoopConf.set(fs.s3n.awsAccessKeyId,
this.getAwsAccessKeyId());
to replicate my issue locally (the code doesn’t
need to run on EC2, I run it directly from my local windows pc).
Regards,
Shuai
From: Aaron Davidson [mailto:ilike...@gmail.com]
Sent: Wednesday, June 10, 2015 12:28 PM
To: Shuai Zheng
Subject: Re: [SPARK-6330] 1.4.0/1.5.0 Bug to access
Hi All,
I have some code to access s3 from Spark. The code is as simple as:
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
Configuration hadoopConf = ctx.hadoopConfiguration();
// aws.secretKey=Zqhjim3GB69hMBvfjh+7NX84p8sMF39BHfXwO3Hs
Hi All,
I want to ask how to use UDF when I use join function on DataFrame. It looks
like always give me the cannot solve the column name error.
Anyone can give me an example on how to run this in java?
My code is like:
edmData.join(yb_lookup,
Hi All,
Basically I try to define a simple UDF and use it in the query, but it gives
me Task not serializable
public void test() {
RiskGroupModelDefinition model =
registeredRiskGroupMap.get(this.modelId);
RiskGroupModelDefinition edm =
PM
To: Dean Wampler
Cc: Shuai Zheng; user@spark.apache.org
Subject: Re: Slower performance when bigger memory?
FWIW, I ran into a similar issue on r3.8xlarge nodes and opted for more/smaller
executors. Another observation was that one large executor results in less
overall read throughput from
Hi All,
I am running some benchmark on r3*8xlarge instance. I have a cluster with
one master (no executor on it) and one slave (r3*8xlarge).
My job has 1000 tasks in stage 0.
R3*8xlarge has 244G memory and 32 cores.
If I create 4 executors, each has 8 core+50G memory, each task
Got it. Thanks! J
From: Yin Huai [mailto:yh...@databricks.com]
Sent: Thursday, April 23, 2015 2:35 PM
To: Shuai Zheng
Cc: user
Subject: Re: Bug? Can't reference to the column by name after join two
DataFrame on a same name key
Hi Shuai,
You can use as to create a table alias
Hi All,
I use 1.3.1
When I have two DF and join them on a same name key, after that, I can't get
the common key by name.
Basically:
select * from t1 inner join t2 on t1.col1 = t2.col1
And I am using purely DataFrame, spark SqlContext not HiveContext
DataFrame df3 =
Below is my code to access s3n without problem (only for 1.3.1. there is a bug
in 1.3.0).
Configuration hadoopConf = ctx.hadoopConfiguration();
hadoopConf.set(fs.s3n.impl,
org.apache.hadoop.fs.s3native.NativeS3FileSystem);
I have similar issue (I failed on the spark core project but with same
exception as you). Then I follow the below steps (I am working on windows):
Delete the maven repository, and re-download all dependency. The issue sounds
like a corrupted jar can’t be opened by maven.
Other than this,
Hi All,
In some cases, I have below exception when I run spark in local mode (I
haven't see this in a cluster). This is weird but also affect my local unit
test case (it is not always happen, but usually one per 4-5 times run). From
the stack, looks like error happen when create the context,
, have no time to dig it out. But if you want me to provide more
information. I will be happy to do that.
Regards,
Shuai
-Original Message-
From: Bozeman, Christopher [mailto:bozem...@amazon.com]
Sent: Wednesday, April 01, 2015 4:59 PM
To: Shuai Zheng; 'Sean Owen'
Cc: 'Akhil Das'; user
, not NULL), I am not sure why this happen (the code without any
problem when run with default java serializer). So I think this is a bug,
but I am not sure it is a bug of spark or a bug of Kryo.
Regards,
Shuai
From: Shuai Zheng [mailto:szheng.c...@gmail.com]
Sent: Monday, April 06, 2015 5
Hi All,
I am a bit confused on spark.storage.memoryFraction, this is used to set the
area for RDD usage, will this RDD means only for cached and persisted RDD?
So if my program has no cached RDD at all (means that I have no .cache() or
.persist() call on any RDD), then I can set this
[mailto:ak...@sigmoidanalytics.com]
Sent: Wednesday, April 01, 2015 2:40 AM
To: Shuai Zheng
Cc: user@spark.apache.org
Subject: Re: --driver-memory parameter doesn't work for spark-submmit on yarn?
Once you submit the job do a ps aux | grep spark-submit and see how much is the
heap space
,
Shuai
-Original Message-
From: Sean Owen [mailto:so...@cloudera.com]
Sent: Wednesday, April 01, 2015 10:51 AM
To: Shuai Zheng
Cc: Akhil Das; user@spark.apache.org
Subject: Re: --driver-memory parameter doesn't work for spark-submmit on yarn?
I feel like I recognize that problem, and it's
Hi All,
Below is the my shell script:
/home/hadoop/spark/bin/spark-submit --driver-memory=5G --executor-memory=40G
--master yarn-client --class com.***.FinancialEngineExecutor
/home/hadoop/lib/my.jar s3://bucket/vriscBatchConf.properties
My driver will load some resources and then
Hi All,
I am waiting the spark 1.3.1 to fix the bug to work with S3 file system.
Anyone knows the release date for 1.3.1? I can't downgrade to 1.2.1 because
there is jar compatible issue to work with AWS SDK.
Regards,
Shuai
Hi,
I am curious, when I start a spark program in local mode, which parameter
will be used to decide the jvm memory size for executor?
In theory should be:
--executor-memory 20G
But I remember local mode will only start spark executor in the same process
of driver, then should be:
[mailto:ilike...@gmail.com]
Sent: Tuesday, March 17, 2015 3:06 PM
To: Imran Rashid
Cc: Shuai Zheng; user@spark.apache.org
Subject: Re: Spark will process _temporary folder on S3 is very slow and always
cause failure
Actually, this is the more relevant JIRA (which is resolved):
https
Below is the output:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1967947
max locked memory (kbytes, -l) 64
max memory
Hi All,
I try to run a simple sort by on 1.2.1. And it always give me below two
errors:
1, 15/03/20 17:48:29 WARN TaskSetManager: Lost task 2.0 in stage 1.0 (TID
35, ip-10-169-217-47.ec2.internal): java.io.FileNotFoundException:
Hi All,
I am running Spark 1.2.1 and AWS SDK. To make sure AWS compatible on the
httpclient 4.2 (which I assume spark use?), I have already downgrade to the
version 1.9.0
But even that, I still got an error:
Exception in thread main java.lang.NoSuchMethodError:
Hi Imran,
I am a bit confused here. Assume I have RDD a with 1000 partition and also has
been sorted. How can I control when creating RDD b (with 20 partitions) to make
sure 1-50 partition of RDD a map to 1st partition of RDD b? I don’t see any
control code/logic here?
You code below:
...@gmail.com]
Sent: Monday, March 16, 2015 1:06 PM
To: Shuai Zheng
Cc: user
Subject: Re: [SPARK-3638 ] java.lang.NoSuchMethodError:
org.apache.http.impl.conn.DefaultClientConnectionOperator.
From my local maven repo:
$ jar tvf
~/.m2/repository/org/apache/httpcomponents/httpclient/4.2.5
Hi All,
I just upgrade the system to use version 1.3.0, but then the
sqlContext.parquetFile doesn't work with s3n. I have test the same code with
1.2.1 and it works.
A simple test running in spark-shell:
val parquetFile = sqlContext.parquetFile(s3n:///test/2.parq )
.
Regards,
Shuai
From: Kelly, Jonathan [mailto:jonat...@amazon.com]
Sent: Monday, March 16, 2015 2:54 PM
To: Shuai Zheng; user@spark.apache.org
Subject: Re: sqlContext.parquetFile doesn't work with s3n in version 1.3.0
See https://issues.apache.org/jira/browse/SPARK-6351
~ Jonathan
Hi All,
I try to run a sorting on a r3.2xlarge instance on AWS. I just try to run it
as a single node cluster for test. The data I use to sort is around 4GB and
sit on S3, output will also on S3.
I just connect spark-shell to the local cluster and run the code in the
script (because I just
?
Regards,
Shuai
From: Shuai Zheng [mailto:szheng.c...@gmail.com]
Sent: Friday, March 13, 2015 6:51 PM
To: user@spark.apache.org
Subject: Spark will process _temporary folder on S3 is very slow and always
cause failure
Hi All,
I try to run a sorting on a r3.2xlarge instance on AWS
Hi All,
I am running spark to deal with AWS.
And aws sdk latest version is working with httpclient 3.4+. Then but
spark-assembly-*-.jar file has packaged an old httpclient version which
cause me: ClassNotFoundException for
org/apache/http/client/methods/HttpPatch
Even when I put the
Hi All,
I try to pass parameter to the spark-shell when I do some test:
spark-shell --driver-memory 512M --executor-memory 4G --master
spark://:7077 --conf spark.sql.parquet.compression.codec=snappy --conf
spark.sql.parquet.binaryAsString=true
This works fine on my local pc. And
Hi All,
I have a lot of parquet files, and I try to open them directly instead of
load them into RDD in driver (so I can optimize some performance through
special logic).
But I do some research online and can't find any example to access parquet
directly from scala, anyone has done this
Hi All,
If I have a set of time series data files, they are in parquet format and
the data for each day are store in naming convention, but I will not know
how many files for one day.
20150101a.parq
20150101b.parq
20150102a.parq
20150102b.parq
20150102c.parq
.
201501010a.parq
.
Hi All,
I am processing some time series data. For one day, it might has 500GB, then
for each hour, it is around 20GB data.
I need to sort the data before I start process. Assume I can sort them
successfully
dayRDD.sortByKey
but after that, I might have thousands of partitions (to
, 2015 6:00 PM
To: Shuai Zheng
Cc: Shao, Saisai; user@spark.apache.org
Subject: Re: Union and reduceByKey will trigger shuffle even same partition?
I think you're getting tripped up lazy evaluation and the way stage boundaries
work (admittedly its pretty confusing in this case).
It is true
enough put in memory), how can I access other RDD's local partition in the
mapParitition method? Is it anyway to do this in Spark?
From: Shao, Saisai [mailto:saisai.s...@intel.com]
Sent: Monday, February 23, 2015 3:13 PM
To: Shuai Zheng
Cc: user@spark.apache.org
Subject: RE: Union and reduceByKey
Hi All,
I am running a simple page rank program, but it is slow. And I dig out part
of reason is there is shuffle happen when I call an union action even both
RDD share the same partition:
Below is my test code in spark shell:
import org.apache.spark.HashPartitioner
: Monday, February 23, 2015 3:13 PM
To: Shuai Zheng
Cc: user@spark.apache.org
Subject: RE: Union and reduceByKey will trigger shuffle even same partition?
If you call reduceByKey(), internally Spark will introduce a shuffle
operations, not matter the data is already partitioned locally, Spark
on it? How Spark to
control/maintain/detect the live of the client spark context?
Do I need to setup something special?
Regards,
Shuai
From: Eugen Cepoi [mailto:cepoi.eu...@gmail.com]
Sent: Thursday, February 05, 2015 5:39 PM
To: Shuai Zheng
Cc: Corey Nolet; Charles Feduke; user
Hi All,
I want to develop a server side application:
User submit request à Server run spark application and return (this might
take a few seconds).
So I want to host the server to keep the long-live context, I dont know
whether this is reasonable or not.
Basically I try to have a
Hi All,
It might sounds weird, but I think spark is perfect to be used as a
multi-threading library in some cases. The local mode will naturally boost
multiple thread when required. Because it is more restrict and less chance
to have potential bug in the code (because it is more data oriental,
), so create two different
DAG for different method.
Anyone can confirm this? This is not something I can easily test with code.
Thanks!
Regards,
Shuai
From: Corey Nolet [mailto:cjno...@gmail.com]
Sent: Thursday, February 05, 2015 11:55 AM
To: Charles Feduke
Cc: Shuai Zheng; user
Message-
From: Sean Owen [mailto:so...@cloudera.com]
Sent: Thursday, February 05, 2015 10:53 AM
To: Shuai Zheng
Cc: user@spark.apache.org
Subject: Re: Use Spark as multi-threading library and deprecate web UI
Do you mean disable the web UI? spark.ui.enabled=false
Sure, it's useful with master
have 1 executor per machine per application.
There are cases where you would make more than 1, but these are
unusual.
On Thu, Jan 15, 2015 at 8:16 PM, Shuai Zheng szheng.c...@gmail.com
wrote:
Hi All,
I try to clarify some behavior in the spark for executor. Because I am
from
Hadoop
Hi Tobias,
Can you share more information about how do you do that? I also have similar
question about this.
Thanks a lot,
Regards,
Shuai
From: Tobias Pfeiffer [mailto:t...@preferred.jp]
Sent: Wednesday, November 26, 2014 12:25 AM
To: Sandy Ryza
Cc: Yanbo Liang; user
Subject:
Hi All,
I am testing Spark on EMR cluster. Env is a one node cluster r3.8xlarge. Has
32 vCore and 244G memory.
But the command line I use to start up spark-shell, it can't work. For
example:
~/spark/bin/spark-shell --jars
/home/hadoop/vrisc-lib/aws-java-sdk-1.9.14/lib/*.jar
Hi All,
I try to clarify some behavior in the spark for executor. Because I am from
Hadoop background, so I try to compare it to the Mapper (or reducer) in
hadoop.
1, Each node can have multiple executors, each run in its own process? This
is same as mapper process.
2, I thought the
behavior - because executor will run the whole
lifecycle of the applications? Although this may not have any real value in
practice J
But I still need help for my first question.
Thanks a lot.
Regards,
Shuai
From: Shuai Zheng [mailto:szheng.c...@gmail.com]
Sent: Thursday, January 15
Thanks a lot!
I just realize the spark is not a really in-memory version of mapreduce J
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Tuesday, January 13, 2015 3:53 PM
To: Shuai Zheng
Cc: user@spark.apache.org
Subject: Re: Why always spilling to disk and how to improve
Hi All,
I am trying with some small data set. It is only 200m, and what I am doing
is just do a distinct count on it.
But there are a lot of spilling happen in the log (I attached in the end of
the email).
Basically I use 10G memory, run on a one-node EMR cluster with r3*8xlarge
instance
Is it possible too many connections open to read from s3 from one node? I
have this issue before because I open a few hundreds of files on s3 to read
from one node. It just block itself without error until timeout later.
On Monday, December 22, 2014, durga durgak...@gmail.com wrote:
Hi All,
I
Hi,
I am running a code which takes a network file (not HDFS) location as
input. But sc.textFile(networklocation\\README.md) can't recognize
the network location start with as a valid location, because it only
accept HDFS and local like file name format?
Anyone has idea how can I use a
Hi All,
When I try to load a folder into the RDDs, any way for me to find the input
file name of particular partitions? So I can track partitions from which
file.
In the hadoop, I can find this information through the code:
FileSplit fileSplit = (FileSplit) context.getInputSplit();
String
PM, Shuai Zheng szheng.c...@gmail.com wrote:
Hi All,
When I try to load a folder into the RDDs, any way for me to find the
input file name of particular partitions? So I can track partitions from
which file.
In the hadoop, I can find this information through the code:
FileSplit fileSplit
Hi All,
I notice if we create a spark context in driver, we need to call stop method
to clear it.
SparkConf sparkConf = new
SparkConf().setAppName(FinancialEngineExecutor);
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
.
String
]
Sent: Wednesday, December 17, 2014 11:04 AM
To: Shuai Zheng; 'Sun, Rui'; user@spark.apache.org
Subject: RE: Control default partition when load a RDD from HDFS
Why not is a good option to create a RDD per each 200Mb file and then apply
the pre-calculations before merging them? I think
Nice, that is the answer I want.
Thanks!
From: Sun, Rui [mailto:rui@intel.com]
Sent: Wednesday, December 17, 2014 1:30 AM
To: Shuai Zheng; user@spark.apache.org
Subject: RE: Control default partition when load a RDD from HDFS
Hi, Shuai,
How did you turn off the file split
Hi All,
My application load 1000 files, each file from 200M - a few GB, and combine
with other data to do calculation.
Some pre-calculation must be done on each file level, then after that, the
result need to combine to do further calculation.
In Hadoop, it is simple because I can
-Original Message-
From: Matei Zaharia [mailto:matei.zaha...@gmail.com]
Sent: Wednesday, November 05, 2014 6:27 PM
To: Shuai Zheng
Cc: user@spark.apache.org
Subject: Re: Any Replicated RDD in Spark?
If you start with an RDD, you do have to collect to the driver and broadcast
to do
(in theory, either way works, but in real world, which one is
better?).
Regards,
Shuai
-Original Message-
From: Matei Zaharia [mailto:matei.zaha...@gmail.com]
Sent: Monday, November 03, 2014 4:15 PM
To: Shuai Zheng
Cc: user@spark.apache.org
Subject: Re: Any Replicated RDD in Spark?
You
!
-Original Message-
From: Shuai Zheng [mailto:szheng.c...@gmail.com]
Sent: Wednesday, November 05, 2014 3:32 PM
To: 'Matei Zaharia'
Cc: 'user@spark.apache.org'
Subject: RE: Any Replicated RDD in Spark?
Nice.
Then I have another question, if I have a file (or a set of files: part-0,
part-1
Hi All,
I have spent last two years on hadoop but new to spark.
I am planning to move one of my existing system to spark to get some
enhanced features.
My question is:
If I try to do a map side join (something similar to Replicated key word
in Pig), how can I do it? Is it anyway to declare a
Hi,
I am planning to run spark on EMR. And because my application might take a
lot of memory. On EMR, I know there is a hard limit 16G physical memory on
individual mapper/reducer (otherwise I will have an exception and this is
confirmed by AWS EMR team, at least it is the spec at this moment).
72 matches
Mail list logo