--
Lan Jiang
https://hpi.de/naumann/people/lan-jiang
Hasso-Plattner-Institut an der Universität Potsdam
Prof.-Dr.-Helmert-Str. 2-3, D-14482 Potsdam
Tel +49 331 5509 280
data, or a
MulticlassClassificationEvaluator for multiclass problems*. "
https://spark.apache.org/docs/2.1.0/ml-tuning.html
Can someone shed some lights on the issue?
Lan
241, which
is fixed in 2.0.
Lan
them proactively? For example, if
processing time/scheduling delay exceed certain threshold, send alert to the
admin/developer?
Lan
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
y question
is why it does not count permgen size and memory used by stack. They are not
part of the max heap size. IMHO, YARN executor container memory should be set
to: spark.executor.memory + [-XX:MaxPermSize] + number_of_threads * [-Xss] +
spark.yarn.executor.memoryOverhead. What did I
but I am worried if it is single large file.
>>
>> In this case, this would only work in single executor which I think will
>> end up with OutOfMemoryException.
>>
>> Spark JSON data source does not support multi-line JSON as input due to
>> the limitation of TextInpu
Hi, there
Spark has provided json document processing feature for a long time. In
most examples I see, each line is a json object in the sample file. That is
the easiest case. But how can we process a json document, which does not
conform to this standard format (one line per json object)? Here
online and
offline learning.
Lan
> On Mar 6, 2016, at 2:43 AM, Chris Miller <cmiller11...@gmail.com> wrote:
>
> Guru:This is a really great response. Thanks for taking the time to explain
> all of this. Helpful for me too.
>
>
> --
> Chris Miller
>
>
Sorry, accidentally sent again. My apology.
> On Mar 6, 2016, at 1:22 PM, Lan Jiang <ljia...@gmail.com> wrote:
>
> Hi, there
>
> I hope someone can clarify this for me. It seems that some of the MLlib
> algorithms such as KMean, Linear Regression and Logistics Regres
should be
able to run in the streaming application. Am I wrong?
Thanks in advance.
Lan
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
should be
able to run in the streaming application. Am I wrong?
Lan
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
Hi, there
I am looking at the SparkSQL setting spark.sql.autoBroadcastJoinThreshold.
According to the programming guide
*Note that currently statistics are only supported for Hive Metastore
tables where the command ANALYZE TABLE COMPUTE STATISTICS
noscan has been run.*
My question is that is
Michael,
Thanks for the reply.
On Wed, Feb 10, 2016 at 11:44 AM, Michael Armbrust
wrote:
> My question is that is "NOSCAN" option a must? If I execute "ANALYZE TABLE
>> compute statistics" command in Hive shell, is the statistics
>> going to be used by SparkSQL to
Hello,
Is it possible for existing R Machine Learning packages (which work with R
data frames) such as bnlearn, to work with SparkR data frames? Or do I need
to convert SparkR data frames to R data frames? Is "collect" the function to
do the conversion, or how else to do that?
Many T
Hello,
Is it possible for existing R Machine Learning packages (which work with R
data frames) such as bnlearn, to work with SparkR data frames? Or do I need
to convert SparkR data frames to R data frames? Is "collect" the function
to do the conversion, or how else to do that?
Many Thanks,
Lan
ed.
I do not find the answer in the document saying whether metadata checkpointing
is done for each batch and whether checkpointinterval setting applies to both
types of checkpointing. Maybe I miss it. If anyone can point me to the right
documentation, I would highly appreciate it.
Best Regards,
Lan
does not have REPL shell, which is a major drawback from my perspective.
Lan
> On Dec 16, 2015, at 3:46 PM, Stephen Boesch <java...@gmail.com> wrote:
>
> There are solid reasons to have built spark on the jvm vs python. The
> question for Daniel appear to be at th
I have not run into any linkage problem, but maybe I was lucky. :-). The
reason I wanted to use protobuf 3 is mainly for Map type support.
On Thu, Nov 5, 2015 at 4:43 AM, Steve Loughran <ste...@hortonworks.com>
wrote:
>
> > On 5 Nov 2015, at 00:12, Lan Jiang <ljia...@gmail.com
protobuf 3 jar file either through —jars during the spark-submit or
package it into a uber jar file with your own classes.
Lan
> On Nov 4, 2015, at 4:07 PM, Cassa L <lcas...@gmail.com> wrote:
>
> Hi,
> Does spark support protobuff 3.0? I used protobuff 2.5 with spark-1.4 bu
is
splittable and will not create so many partitions.
Lan
> On Oct 20, 2015, at 8:03 AM, François Pelletier
> <newslett...@francoispelletier.org> wrote:
>
> You should aggregate your files in larger chunks before doing anything else.
> HDFS is not fit for small files. It wi
I think the data file is binary per the original post. So in this case,
sc.binaryFiles should be used. However, I still recommend against using so many
small binary files as
1. They are not good for batch I/O
2. They put too many memory pressure on namenode.
Lan
> On Oct 20, 2015, at 11
this is the problem.
After I increased the spark.yarn.executor.memoryOverhead, it was working
fine. I was using Spark 1.3, which has the defaut value as executorMemory *
0.07, with minimum of 384. In spark 1.4 and later, the default value was
changed to executorMemory * 0.10, with minimum of 384.
Lan
Hmm, that’s odd.
You can always use repartition(n) to increase the partition number, but then
there will be shuffle. How large is your ORC file? Have you used NameNode UI to
check how many HDFS blocks each ORC file has?
Lan
> On Oct 8, 2015, at 2:08 PM, Umesh Kacha <umesh.ka...@gma
still a bug or there is something I need to do in spark application to
report the correct job status to YARN?
Lan
The partition number should be the same as the HDFS block number instead of
file number. Did you confirmed from the spark UI that only 12 partitions were
created? What is your ORC orc.stripe.size?
Lan
> On Oct 8, 2015, at 1:13 PM, unk1102 <umesh.ka...@gmail.com> wrote:
>
to be 6g. thus I
expect the memory cache to be 6 * 0.9 * 0.6 = 3.24g. However, on the Spark
history server, it shows the reserved cached size for each executor is
3.1g. So it does not add up. What do I miss?
Lan
, write the result to HDFS. I use spark 1.3 with
spark-avro (1.0.0). The error only happens when running on the whole
dataset. When running on 1/3 of the files, the same job completes without
error.
On Thu, Oct 1, 2015 at 2:41 PM, Lan Jiang <ljia...@gmail.com> wrote:
> Hi, there
executors to find out why they were lost?
Thanks
Lan
g URL" in the application overview section. When I click it, it
brings me to the spark history server UI, where I cannot find the lost
exectuors. The only logs link I can find one the YARN RM site is the
ApplicationMaster log, which is not what I need. Did I miss something?
Lan
On Thu, Oct 1, 20
Hi, there
Here is the problem I ran into when executing a Spark Job (Spark 1.3). The
spark job is loading a bunch of avro files using Spark SQL spark-avro 1.0.0
library. Then it does some filter/map transformation, repartition to 1
partition and then write to HDFS. It creates 2 stages. The total
5.
Is my understanding correct? In this case, I think repartition is a better
choice than coalesce.
Lan
I am happy to report that after set spark.dirver.userClassPathFirst, I can use
protobuf 3 with spark-shell. Looks like the classloading issue in the driver,
not executor.
Marcelo, thank you very much for the tip!
Lan
> On Sep 15, 2015, at 1:40 PM, Marcelo Vanzin <van...@cloudera.com&
that and it did not work either.
Lan
> On Sep 15, 2015, at 10:31 AM, java8964 <java8...@hotmail.com> wrote:
>
> If you use Standalone mode, just start spark-shell like following:
>
> spark-shell --jars your_uber_jar --conf spark.files.userClassPathFirst=true
>
>
is how to
configure spark shell to use my uber jar first.
java8964 -- appreciate the link and I will try the configuration. Looks
promising. However, the "user classpath first" attribute does not apply to
spark-shell, am I correct?
Lan
On Tue, Sep 15, 2015 at 8:24 AM, java8964 <java8..
Hi, there
I ran into a problem when I try to pass external jar file to spark-shell.
I have a uber jar file that contains all the java codes I created for protobuf
and all its dependency.
If I simply execute my code using Scala Shell, it works fine without error. I
use -cp to pass the
and extend the question to any third party libraries. How to deal with
version conflict for any third party libraries included in the Spark
distribution?
Thanks!
Lan
-library is provided, you need to change
it to compile to run SparkPi in Intellij. As I remember, you also need to
change guava and jetty related library to compile too.
On Mon, Aug 17, 2015 at 2:14 AM, xiaohe lan zombiexco...@gmail.com
wrote:
Hi,
I am trying to run SparkPi in Intellij
Hi,
I am trying to run SparkPi in Intellij and getting NoClassDefFoundError.
Anyone else saw this issue before ?
Exception in thread main java.lang.NoClassDefFoundError:
scala/collection/Seq
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
at
Change jdk from 1.8.0_45 to 1.7.0_79 solve this issue.
I saw https://issues.apache.org/jira/browse/SPARK-6388
But it is not a problem however.
On Thu, Jul 2, 2015 at 1:30 PM, xiaohe lan zombiexco...@gmail.com wrote:
Hi Expert,
Hadoop version: 2.4
Spark version: 1.3.1
I am running
Hi Expert,
Hadoop version: 2.4
Spark version: 1.3.1
I am running the SparkPi example application.
bin/spark-submit --class org.apache.spark.examples.SparkPi --master
yarn-client --executor-memory 2G lib/spark-examples-1.3.1-hadoop2.4.0.jar
2
The same command sometimes gets WARN
...@cloudera.com
wrote:
Awesome!
It's documented here:
https://spark.apache.org/docs/latest/submitting-applications.html
-Sandy
On Mon, May 18, 2015 at 8:03 PM, xiaohe lan zombiexco...@gmail.com
wrote:
Hi Sandy,
Thanks for your information. Yes, spark-submit --master yarn
--num-executors 5
, Sandy Ryza sandy.r...@cloudera.com
wrote:
Hi Xiaohe,
The all Spark options must go before the jar or they won't take effect.
-Sandy
On Sun, May 17, 2015 at 8:59 AM, xiaohe lan zombiexco...@gmail.com
wrote:
Sorry, them both are assigned task actually.
Aggregated Metrics by Executor
MB295.4
MB2host2:620721.7 min505640.0 MB / 12014510386.0 MB / 109269121646.6 MB304.8
MB
On Sun, May 17, 2015 at 11:50 PM, xiaohe lan zombiexco...@gmail.com wrote:
bash-4.1$ ps aux | grep SparkSubmit
xilan 1704 13.2 1.2 5275520 380244 pts/0 Sl+ 08:39 0:13
/scratch/xilan/jdk1.8.0_45/bin
:
Did you try --executor-cores param? While you submit the job, do a ps aux
| grep spark-submit and see the exact command parameters.
Thanks
Best Regards
On Sat, May 16, 2015 at 12:31 PM, xiaohe lan zombiexco...@gmail.com
wrote:
Hi,
I have a 5 nodes yarn cluster, I used spark-submit
link:
http://mbonaci.github.io/mbo-spark/
You dont need to install spark on every node.Just install it on one node
or you can install it on remote system also and made a spark cluster.
Thanks
Madhvi
On Thursday 30 April 2015 09:31 AM, xiaohe lan wrote:
Hi experts,
I see spark on yarn has
Hi experts,
I see spark on yarn has yarn-client and yarn-cluster mode. I also have a 5
nodes hadoop cluster (hadoop 2.4). How to install spark if I want to try
the spark on yarn mode.
Do I need to install spark on the each node of hadoop cluster ?
Thanks,
Xiaohe
YARN capacity scheduler support hierarchical queues, which you can assign
cluster resource as percentage. Your spark application/shell can be
submitted to different queues. Mesos supports fine-grained mode, which
allows the machines/cores used each executors ramp up and down.
Lan
On Wed, Apr 22
Hello,
I have the above naive question if anyone could help. Why not using a
Row-based File format to save Row-based DataFrames/RDD?
Thanks,
Lan
Hello,
I have the above naive question if anyone could help. Why not using a
Row-based File format to save Row-based DataFrames/RDD?
Thanks,
Lan
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Columnar-Parquet-used-as-default-for-saving-Row-based
automatically without you
copying them manually.
Lan
On Apr 20, 2015, at 9:26 AM, Michael Ryabtsev michael...@gmail.com wrote:
Hi all,
I need to configure spark executor log4j.properties on a standalone cluster.
It looks like placing the relevant properties file in the spark
configuration
Each application gets its own executor processes, so there should be no
problem running them in parallel.
Lan
On Apr 20, 2015, at 10:25 AM, Michael Ryabtsev michael...@gmail.com wrote:
Hi Lan,
Thanks for fast response. It could be a solution if it works. I have more
than one log4
Hi Alexey and Daniel,
I'm using Spark 1.2.0 and still having the same error, as described below.
Do you have any news on this? Really appreciate your responses!!!
a Spark cluster of 1 master VM SparkV1 and 1 worker VM SparkV4 (the error
is the same if I have 2 workers). They are connected
it is shutting down.
]
More about the setup: each VM has only 4GB RAM, running Ubuntu, using
spark-1.2.0, built for Hadoop 2.6.0.
I have struggled with this error for a few days. Could anyone please tell me
what the problem is and how to fix it?
Thanks,
Lan
--
View this message in context:
http
53 matches
Mail list logo