unsubscribe

2020-04-16 Thread Jiang, Lan
-- Lan Jiang https://hpi.de/naumann/people/lan-jiang Hasso-Plattner-Institut an der Universität Potsdam Prof.-Dr.-Helmert-Str. 2-3, D-14482 Potsdam Tel +49 331 5509 280

BinaryClassificationMetrics only supports AreaUnderPR and AreaUnderROC?

2017-05-11 Thread Lan Jiang
data, or a MulticlassClassificationEvaluator for multiclass problems*. " https://spark.apache.org/docs/2.1.0/ml-tuning.html Can someone shed some lights on the issue? Lan

Does monotonically_increasing_id generates the same id even when executor fails or being evicted out of memory

2017-02-28 Thread Lan Jiang
241, which is fixed in 2.0. Lan

Spark Streaming proactive monitoring

2017-01-23 Thread Lan Jiang
them proactively? For example, if processing time/scheduling delay exceed certain threshold, send alert to the admin/developer? Lan - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark Yarn executor container memory

2016-08-15 Thread Lan Jiang
y question is why it does not count permgen size and memory used by stack. They are not part of the max heap size. IMHO, YARN executor container memory should be set to: spark.executor.memory + [-XX:MaxPermSize] + number_of_threads * [-Xss] + spark.yarn.executor.memoryOverhead. What did I

Re: Processing json document

2016-07-07 Thread Lan Jiang
but I am worried if it is single large file. >> >> In this case, this would only work in single executor which I think will >> end up with OutOfMemoryException. >> >> Spark JSON data source does not support multi-line JSON as input due to >> the limitation of TextInpu

Processing json document

2016-07-06 Thread Lan Jiang
Hi, there Spark has provided json document processing feature for a long time. In most examples I see, each line is a json object in the sample file. That is the easiest case. But how can we process a json document, which does not conform to this standard format (one line per json object)? Here

Re: MLLib + Streaming

2016-03-06 Thread Lan Jiang
online and offline learning. Lan > On Mar 6, 2016, at 2:43 AM, Chris Miller <cmiller11...@gmail.com> wrote: > > Guru:This is a really great response. Thanks for taking the time to explain > all of this. Helpful for me too. > > > -- > Chris Miller > >

Re: Spark ML and Streaming

2016-03-06 Thread Lan Jiang
Sorry, accidentally sent again. My apology. > On Mar 6, 2016, at 1:22 PM, Lan Jiang <ljia...@gmail.com> wrote: > > Hi, there > > I hope someone can clarify this for me. It seems that some of the MLlib > algorithms such as KMean, Linear Regression and Logistics Regres

Spark ML and Streaming

2016-03-06 Thread Lan Jiang
should be able to run in the streaming application. Am I wrong? Thanks in advance. Lan - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

MLLib + Streaming

2016-03-05 Thread Lan Jiang
should be able to run in the streaming application. Am I wrong? Lan - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

broadcast join in SparkSQL requires analyze table noscan

2016-02-10 Thread Lan Jiang
Hi, there I am looking at the SparkSQL setting spark.sql.autoBroadcastJoinThreshold. According to the programming guide *Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE COMPUTE STATISTICS noscan has been run.* My question is that is

Re: broadcast join in SparkSQL requires analyze table noscan

2016-02-10 Thread Lan Jiang
Michael, Thanks for the reply. On Wed, Feb 10, 2016 at 11:44 AM, Michael Armbrust wrote: > My question is that is "NOSCAN" option a must? If I execute "ANALYZE TABLE >> compute statistics" command in Hive shell, is the statistics >> going to be used by SparkSQL to

Do existing R packages work with SparkR data frames

2015-12-22 Thread Lan
Hello, Is it possible for existing R Machine Learning packages (which work with R data frames) such as bnlearn, to work with SparkR data frames? Or do I need to convert SparkR data frames to R data frames? Is "collect" the function to do the conversion, or how else to do that? Many T

Do existing R packages work with SparkR data frames

2015-12-22 Thread Duy Lan Nguyen
Hello, Is it possible for existing R Machine Learning packages (which work with R data frames) such as bnlearn, to work with SparkR data frames? Or do I need to convert SparkR data frames to R data frames? Is "collect" the function to do the conversion, or how else to do that? Many Thanks, Lan

Question about Spark Streaming checkpoint interval

2015-12-18 Thread Lan Jiang
ed. I do not find the answer in the document saying whether metadata checkpointing is done for each batch and whether checkpointinterval setting applies to both types of checkpointing. Maybe I miss it. If anyone can point me to the right documentation, I would highly appreciate it. Best Regards, Lan

Re: Scala VS Java VS Python

2015-12-16 Thread Lan Jiang
does not have REPL shell, which is a major drawback from my perspective. Lan > On Dec 16, 2015, at 3:46 PM, Stephen Boesch <java...@gmail.com> wrote: > > There are solid reasons to have built spark on the jvm vs python. The > question for Daniel appear to be at th

Re: Protobuff 3.0 for Spark

2015-11-09 Thread Lan Jiang
I have not run into any linkage problem, but maybe I was lucky. :-). The reason I wanted to use protobuf 3 is mainly for Map type support. On Thu, Nov 5, 2015 at 4:43 AM, Steve Loughran <ste...@hortonworks.com> wrote: > > > On 5 Nov 2015, at 00:12, Lan Jiang <ljia...@gmail.com

Re: Protobuff 3.0 for Spark

2015-11-04 Thread Lan Jiang
protobuf 3 jar file either through —jars during the spark-submit or package it into a uber jar file with your own classes. Lan > On Nov 4, 2015, at 4:07 PM, Cassa L <lcas...@gmail.com> wrote: > > Hi, > Does spark support protobuff 3.0? I used protobuff 2.5 with spark-1.4 bu

Re: Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-20 Thread Lan Jiang
is splittable and will not create so many partitions. Lan > On Oct 20, 2015, at 8:03 AM, François Pelletier > <newslett...@francoispelletier.org> wrote: > > You should aggregate your files in larger chunks before doing anything else. > HDFS is not fit for small files. It wi

Re: Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-20 Thread Lan Jiang
I think the data file is binary per the original post. So in this case, sc.binaryFiles should be used. However, I still recommend against using so many small binary files as 1. They are not good for batch I/O 2. They put too many memory pressure on namenode. Lan > On Oct 20, 2015, at 11

Re: "java.io.IOException: Filesystem closed" on executors

2015-10-14 Thread Lan Jiang
this is the problem. After I increased the spark.yarn.executor.memoryOverhead, it was working fine. I was using Spark 1.3, which has the defaut value as executorMemory * 0.07, with minimum of 384. In spark 1.4 and later, the default value was changed to executorMemory * 0.10, with minimum of 384. Lan

Re: How to increase Spark partitions for the DataFrame?

2015-10-08 Thread Lan Jiang
Hmm, that’s odd. You can always use repartition(n) to increase the partition number, but then there will be shuffle. How large is your ORC file? Have you used NameNode UI to check how many HDFS blocks each ORC file has? Lan > On Oct 8, 2015, at 2:08 PM, Umesh Kacha <umesh.ka...@gma

failed spark job reports on YARN as successful

2015-10-08 Thread Lan Jiang
still a bug or there is something I need to do in spark application to report the correct job status to YARN? Lan

Re: How to increase Spark partitions for the DataFrame?

2015-10-08 Thread Lan Jiang
The partition number should be the same as the HDFS block number instead of file number. Did you confirmed from the spark UI that only 12 partitions were created? What is your ORC orc.stripe.size? Lan > On Oct 8, 2015, at 1:13 PM, unk1102 <umesh.ka...@gmail.com> wrote: >

Spark cache memory storage

2015-10-06 Thread Lan Jiang
to be 6g. thus I expect the memory cache to be 6 * 0.9 * 0.6 = 3.24g. However, on the Spark history server, it shows the reserved cached size for each executor is 3.1g. So it does not add up. What do I miss? Lan

Re: "java.io.IOException: Filesystem closed" on executors

2015-10-05 Thread Lan Jiang
, write the result to HDFS. I use spark 1.3 with spark-avro (1.0.0). The error only happens when running on the whole dataset. When running on 1/3 of the files, the same job completes without error. On Thu, Oct 1, 2015 at 2:41 PM, Lan Jiang <ljia...@gmail.com> wrote: > Hi, there

How to access lost executor log file

2015-10-01 Thread Lan Jiang
executors to find out why they were lost? Thanks Lan

Re: How to access lost executor log file

2015-10-01 Thread Lan Jiang
g URL" in the application overview section. When I click it, it brings me to the spark history server UI, where I cannot find the lost exectuors. The only logs link I can find one the YARN RM site is the ApplicationMaster log, which is not what I need. Did I miss something? Lan On Thu, Oct 1, 20

"java.io.IOException: Filesystem closed" on executors

2015-10-01 Thread Lan Jiang
Hi, there Here is the problem I ran into when executing a Spark Job (Spark 1.3). The spark job is loading a bunch of avro files using Spark SQL spark-avro 1.0.0 library. Then it does some filter/map transformation, repartition to 1 partition and then write to HDFS. It creates 2 stages. The total

unintended consequence of using coalesce operation

2015-09-29 Thread Lan Jiang
5. Is my understanding correct? In this case, I think repartition is a better choice than coalesce. Lan

Re: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread Lan Jiang
I am happy to report that after set spark.dirver.userClassPathFirst, I can use protobuf 3 with spark-shell. Looks like the classloading issue in the driver, not executor. Marcelo, thank you very much for the tip! Lan > On Sep 15, 2015, at 1:40 PM, Marcelo Vanzin <van...@cloudera.com&

Re: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread Lan Jiang
that and it did not work either. Lan > On Sep 15, 2015, at 10:31 AM, java8964 <java8...@hotmail.com> wrote: > > If you use Standalone mode, just start spark-shell like following: > > spark-shell --jars your_uber_jar --conf spark.files.userClassPathFirst=true > >

Re: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread Lan Jiang
is how to configure spark shell to use my uber jar first. java8964 -- appreciate the link and I will try the configuration. Looks promising. However, the "user classpath first" attribute does not apply to spark-shell, am I correct? Lan On Tue, Sep 15, 2015 at 8:24 AM, java8964 <java8..

add external jar file to Spark shell vs. Scala Shell

2015-09-14 Thread Lan Jiang
Hi, there I ran into a problem when I try to pass external jar file to spark-shell. I have a uber jar file that contains all the java codes I created for protobuf and all its dependency. If I simply execute my code using Scala Shell, it works fine without error. I use -cp to pass the

Change protobuf version or any other third party library version in Spark application

2015-09-14 Thread Lan Jiang
and extend the question to any third party libraries. How to deal with version conflict for any third party libraries included in the Spark distribution? Thanks! Lan

Re: SparkPi is geting java.lang.NoClassDefFoundError: scala/collection/Seq

2015-08-17 Thread xiaohe lan
-library is provided, you need to change it to compile to run SparkPi in Intellij. As I remember, you also need to change guava and jetty related library to compile too. On Mon, Aug 17, 2015 at 2:14 AM, xiaohe lan zombiexco...@gmail.com wrote: Hi, I am trying to run SparkPi in Intellij

SparkPi is geting java.lang.NoClassDefFoundError: scala/collection/Seq

2015-08-16 Thread xiaohe lan
Hi, I am trying to run SparkPi in Intellij and getting NoClassDefFoundError. Anyone else saw this issue before ? Exception in thread main java.lang.NoClassDefFoundError: scala/collection/Seq at org.apache.spark.examples.SparkPi.main(SparkPi.scala) at

Re: getting WARN ReliableDeliverySupervisor

2015-07-02 Thread xiaohe lan
Change jdk from 1.8.0_45 to 1.7.0_79 solve this issue. I saw https://issues.apache.org/jira/browse/SPARK-6388 But it is not a problem however. On Thu, Jul 2, 2015 at 1:30 PM, xiaohe lan zombiexco...@gmail.com wrote: Hi Expert, Hadoop version: 2.4 Spark version: 1.3.1 I am running

getting WARN ReliableDeliverySupervisor

2015-07-01 Thread xiaohe lan
Hi Expert, Hadoop version: 2.4 Spark version: 1.3.1 I am running the SparkPi example application. bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --executor-memory 2G lib/spark-examples-1.3.1-hadoop2.4.0.jar 2 The same command sometimes gets WARN

Re: number of executors

2015-05-18 Thread xiaohe lan
...@cloudera.com wrote: Awesome! It's documented here: https://spark.apache.org/docs/latest/submitting-applications.html -Sandy On Mon, May 18, 2015 at 8:03 PM, xiaohe lan zombiexco...@gmail.com wrote: Hi Sandy, Thanks for your information. Yes, spark-submit --master yarn --num-executors 5

Re: number of executors

2015-05-18 Thread xiaohe lan
, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Xiaohe, The all Spark options must go before the jar or they won't take effect. -Sandy On Sun, May 17, 2015 at 8:59 AM, xiaohe lan zombiexco...@gmail.com wrote: Sorry, them both are assigned task actually. Aggregated Metrics by Executor

Re: number of executors

2015-05-17 Thread xiaohe lan
MB295.4 MB2host2:620721.7 min505640.0 MB / 12014510386.0 MB / 109269121646.6 MB304.8 MB On Sun, May 17, 2015 at 11:50 PM, xiaohe lan zombiexco...@gmail.com wrote: bash-4.1$ ps aux | grep SparkSubmit xilan 1704 13.2 1.2 5275520 380244 pts/0 Sl+ 08:39 0:13 /scratch/xilan/jdk1.8.0_45/bin

Re: number of executors

2015-05-17 Thread xiaohe lan
: Did you try --executor-cores param? While you submit the job, do a ps aux | grep spark-submit and see the exact command parameters. Thanks Best Regards On Sat, May 16, 2015 at 12:31 PM, xiaohe lan zombiexco...@gmail.com wrote: Hi, I have a 5 nodes yarn cluster, I used spark-submit

Re: How to install spark in spark on yarn mode

2015-04-30 Thread xiaohe lan
link: http://mbonaci.github.io/mbo-spark/ You dont need to install spark on every node.Just install it on one node or you can install it on remote system also and made a spark cluster. Thanks Madhvi On Thursday 30 April 2015 09:31 AM, xiaohe lan wrote: Hi experts, I see spark on yarn has

How to install spark in spark on yarn mode

2015-04-29 Thread xiaohe lan
Hi experts, I see spark on yarn has yarn-client and yarn-cluster mode. I also have a 5 nodes hadoop cluster (hadoop 2.4). How to install spark if I want to try the spark on yarn mode. Do I need to install spark on the each node of hadoop cluster ? Thanks, Xiaohe

Re: Scheduling across applications - Need suggestion

2015-04-22 Thread Lan Jiang
YARN capacity scheduler support hierarchical queues, which you can assign cluster resource as percentage. Your spark application/shell can be submitted to different queues. Mesos supports fine-grained mode, which allows the machines/cores used each executors ramp up and down. Lan On Wed, Apr 22

Why is Columnar Parquet used as default for saving Row-based DataFrames/RDD?

2015-04-20 Thread Duy Lan Nguyen
Hello, I have the above naive question if anyone could help. Why not using a Row-based File format to save Row-based DataFrames/RDD? Thanks, Lan

Why is Columnar Parquet used as default for saving Row-based DataFrames/RDD?

2015-04-20 Thread Lan
Hello, I have the above naive question if anyone could help. Why not using a Row-based File format to save Row-based DataFrames/RDD? Thanks, Lan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Columnar-Parquet-used-as-default-for-saving-Row-based

Re: Configuring logging properties for executor

2015-04-20 Thread Lan Jiang
automatically without you copying them manually. Lan On Apr 20, 2015, at 9:26 AM, Michael Ryabtsev michael...@gmail.com wrote: Hi all, I need to configure spark executor log4j.properties on a standalone cluster. It looks like placing the relevant properties file in the spark configuration

Re: Configuring logging properties for executor

2015-04-20 Thread Lan Jiang
Each application gets its own executor processes, so there should be no problem running them in parallel. Lan On Apr 20, 2015, at 10:25 AM, Michael Ryabtsev michael...@gmail.com wrote: Hi Lan, Thanks for fast response. It could be a solution if it works. I have more than one log4

Re: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down

2015-02-11 Thread Lan
Hi Alexey and Daniel, I'm using Spark 1.2.0 and still having the same error, as described below. Do you have any news on this? Really appreciate your responses!!! a Spark cluster of 1 master VM SparkV1 and 1 worker VM SparkV4 (the error is the same if I have 2 workers). They are connected

ERROR EndpointWriter: AssociationError

2015-02-07 Thread Lan
it is shutting down. ] More about the setup: each VM has only 4GB RAM, running Ubuntu, using spark-1.2.0, built for Hadoop 2.6.0. I have struggled with this error for a few days. Could anyone please tell me what the problem is and how to fix it? Thanks, Lan -- View this message in context: http