BinaryClassificationMetrics only supports AreaUnderPR and AreaUnderROC?

2017-05-11 Thread Lan Jiang
I realized that in the Spark ML, BinaryClassifcationMetrics only supports AreaUnderPR and AreaUnderROC. Why is that? I What if I need other metrics such as F-score, accuracy? I tried to use MulticlassClassificationEvaluator to evaluate other metrics such as Accuracy for a binary classification

Does monotonically_increasing_id generates the same id even when executor fails or being evicted out of memory

2017-02-28 Thread Lan Jiang
Hi, there I am trying to generate unique ID for each record in a dataframe, so that I can save the dataframe to a relational database table. My question is that when the dataframe is regenerated due to executor failure or being evicted out of cache, does the ID keeps the same as before?

Spark Streaming proactive monitoring

2017-01-23 Thread Lan Jiang
Hi, there From the Spark UI, we can monitor the following two metrics: • Processing Time - The time to process each batch of data. • Scheduling Delay - the time a batch waits in a queue for the processing of previous batches to finish. However, what is the best way to monitor

Spark Yarn executor container memory

2016-08-15 Thread Lan Jiang
Hello, My understanding is that YARN executor container memory is based on "spark.executor.memory" + “spark.yarn.executor.memoryOverhead”. The first one is for heap memory and second one is for offheap memory. The spark.executor.memory is used by -Xmx to set the max heap size. Now my question

Re: Processing json document

2016-07-07 Thread Lan Jiang
but I am worried if it is single large file. >> >> In this case, this would only work in single executor which I think will >> end up with OutOfMemoryException. >> >> Spark JSON data source does not support multi-line JSON as input due to >> the limitation of TextInpu

Processing json document

2016-07-06 Thread Lan Jiang
Hi, there Spark has provided json document processing feature for a long time. In most examples I see, each line is a json object in the sample file. That is the easiest case. But how can we process a json document, which does not conform to this standard format (one line per json object)? Here

Re: MLLib + Streaming

2016-03-06 Thread Lan Jiang
p where as online training can be hard and needs to be > constantly monitored to see how it is performing. > > Hope this helps in understanding offline learning vs. online learning and > which algorithms you can choose for online learning in MLlib. > > Guru Medasani > gdm...@

Re: Spark ML and Streaming

2016-03-06 Thread Lan Jiang
Sorry, accidentally sent again. My apology. > On Mar 6, 2016, at 1:22 PM, Lan Jiang <ljia...@gmail.com> wrote: > > Hi, there > > I hope someone can clarify this for me. It seems that some of the MLlib > algorithms such as KMean, Linear Regression and Logistics Regres

Spark ML and Streaming

2016-03-06 Thread Lan Jiang
Hi, there I hope someone can clarify this for me. It seems that some of the MLlib algorithms such as KMean, Linear Regression and Logistics Regression have a Streaming version, which can do online machine learning. But does that mean other MLLib algorithm cannot be used in Spark streaming

MLLib + Streaming

2016-03-05 Thread Lan Jiang
Hi, there I hope someone can clarify this for me. It seems that some of the MLlib algorithms such as KMean, Linear Regression and Logistics Regression have a Streaming version, which can do online machine learning. But does that mean other MLLib algorithm cannot be used in Spark streaming

broadcast join in SparkSQL requires analyze table noscan

2016-02-10 Thread Lan Jiang
Hi, there I am looking at the SparkSQL setting spark.sql.autoBroadcastJoinThreshold. According to the programming guide *Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE COMPUTE STATISTICS noscan has been run.* My question is that is

Re: broadcast join in SparkSQL requires analyze table noscan

2016-02-10 Thread Lan Jiang
Michael, Thanks for the reply. On Wed, Feb 10, 2016 at 11:44 AM, Michael Armbrust wrote: > My question is that is "NOSCAN" option a must? If I execute "ANALYZE TABLE >> compute statistics" command in Hive shell, is the statistics >> going to be used by SparkSQL to

Question about Spark Streaming checkpoint interval

2015-12-18 Thread Lan Jiang
Need some clarification about the documentation. According to Spark doc "the default interval is a multiple of the batch interval that is at least 10 seconds. It can be set by using dstream.checkpoint(checkpointInterval). Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream

Re: Scala VS Java VS Python

2015-12-16 Thread Lan Jiang
For Spark data science project, Python might be a good choice. However, for Spark streaming, Python API is still lagging. For example, for Kafka no receiver connector, according to the Spark 1.5.2 doc: "Spark 1.4 added a Python API, but it is not yet at full feature parity”. Java does not

Re: Protobuff 3.0 for Spark

2015-11-09 Thread Lan Jiang
I have not run into any linkage problem, but maybe I was lucky. :-). The reason I wanted to use protobuf 3 is mainly for Map type support. On Thu, Nov 5, 2015 at 4:43 AM, Steve Loughran <ste...@hortonworks.com> wrote: > > > On 5 Nov 2015, at 00:12, Lan Jiang <ljia...@gmail.com

Re: Protobuff 3.0 for Spark

2015-11-04 Thread Lan Jiang
I have used protobuf 3 successfully with Spark on CDH 5.4, even though Hadoop itself comes with protobuf 2.5. I think the steps apply to HDP too. You need to do the following 1. Set the below parameter spark.executor.userClassPathFirst=true spark.driver.userClassPathFirst=true 2. Include

Re: Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-20 Thread Lan Jiang
As Francois pointed out, you are encountering a classic small file anti-pattern. One solution I used in the past is to wrap all these small binary files into a sequence file or avro file. For example, the avro schema can have two fields: filename: string and binaryname:byte[]. Thus your file is

Re: Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-20 Thread Lan Jiang
s/1.4.0/api/java/org/apache/spark/SparkContext.html#wholeTextFiles(java.lang.String,%20int)> > > On 20 October 2015 at 15:04, Lan Jiang <ljia...@gmail.com > <mailto:ljia...@gmail.com>> wrote: > As Francois pointed out, you are encountering a classic small file > anti-patt

Re: "java.io.IOException: Filesystem closed" on executors

2015-10-14 Thread Lan Jiang
On Mon, Oct 12, 2015 at 8:34 AM, Akhil Das <ak...@sigmoidanalytics.com> wrote: > Can you look a bit deeper in the executor logs? It could be filling up the > memory and getting killed. > > Thanks > Best Regards > > On Mon, Oct 5, 2015 at 8:55 PM, Lan Jiang <ljia...@gma

Re: How to increase Spark partitions for the DataFrame?

2015-10-08 Thread Lan Jiang
il.com> wrote: > > Hi Lan, thanks for the response yes I know and I have confirmed in UI that it > has only 12 partitions because of 12 HDFS blocks and hive orc file strip size > is 33554432. > > On Thu, Oct 8, 2015 at 11:55 PM, Lan Jiang <ljia...@gmail.com >

failed spark job reports on YARN as successful

2015-10-08 Thread Lan Jiang
Hi, there I have a spark batch job running on CDH5.4 + Spark 1.3.0. Job is submitted in “yarn-client” mode. The job itself failed due to YARN kills several executor containers because the containers exceeded the memory limit posed by YARN. However, when I went to the YARN resource manager

Re: How to increase Spark partitions for the DataFrame?

2015-10-08 Thread Lan Jiang
The partition number should be the same as the HDFS block number instead of file number. Did you confirmed from the spark UI that only 12 partitions were created? What is your ORC orc.stripe.size? Lan > On Oct 8, 2015, at 1:13 PM, unk1102 wrote: > > Hi I have the

Spark cache memory storage

2015-10-06 Thread Lan Jiang
Hi, there My understanding is that the cache storage is calculated as following executor heap size * spark.storage.safetyFraction * spark.storage.memoryFraction. The default value for safetyFraction is 0.9 and memoryFraction is 0.6. When I started a spark job on YARN, I set executor-memory to

Re: "java.io.IOException: Filesystem closed" on executors

2015-10-05 Thread Lan Jiang
, write the result to HDFS. I use spark 1.3 with spark-avro (1.0.0). The error only happens when running on the whole dataset. When running on 1/3 of the files, the same job completes without error. On Thu, Oct 1, 2015 at 2:41 PM, Lan Jiang <ljia...@gmail.com> wrote: > Hi, there

How to access lost executor log file

2015-10-01 Thread Lan Jiang
Hi, there When running a Spark job on YARN, 2 executors somehow got lost during the execution. The message on the history server GUI is “CANNOT find address”. Two extra executors were launched by YARN and eventually finished the job. Usually I go to the “Executors” tab on the UI to check the

Re: How to access lost executor log file

2015-10-01 Thread Lan Jiang
15 at 1:30 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Can you go to YARN RM UI to find all the attempts for this Spark Job ? > > The two lost executors should be found there. > > On Thu, Oct 1, 2015 at 10:30 AM, Lan Jiang <ljia...@gmail.com> wrote: > >> Hi, t

"java.io.IOException: Filesystem closed" on executors

2015-10-01 Thread Lan Jiang
Hi, there Here is the problem I ran into when executing a Spark Job (Spark 1.3). The spark job is loading a bunch of avro files using Spark SQL spark-avro 1.0.0 library. Then it does some filter/map transformation, repartition to 1 partition and then write to HDFS. It creates 2 stages. The total

unintended consequence of using coalesce operation

2015-09-29 Thread Lan Jiang
Hi, there I ran into an issue when using Spark (v 1.3) to load avro file through Spark SQL. The code sample is below val df = sqlContext.load(“path-to-avro-file","com.databricks.spark.avro”) val myrdd = df.select(“Key", “Name", “binaryfield").rdd val results = myrdd.map(...) val finalResults =

Re: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread Lan Jiang
using spark-submit. > > If that doesn't work, I'd recommend shading, as others already have. > > On Tue, Sep 15, 2015 at 9:19 AM, Lan Jiang <ljia...@gmail.com> wrote: >> I used the --conf spark.files.userClassPathFirst=true in the spark-shell >> option, it still ga

Re: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread Lan Jiang
996> > > Yong > > Subject: Re: Change protobuf version or any other third party library version > in Spark application > From: ste...@hortonworks.com <mailto:ste...@hortonworks.com> > To: ljia...@gmail.com <mailto:ljia...@gmail.com> > CC: user@spark.apache.org <mailto:u

Re: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread Lan Jiang
ste...@hortonworks.com > To: ljia...@gmail.com > CC: user@spark.apache.org > Date: Tue, 15 Sep 2015 09:19:28 + > > > > > On 15 Sep 2015, at 05:47, Lan Jiang <ljia...@gmail.com> wrote: > > Hi, there, > > I am using Spark 1.4.1. The protobuf 2.5 is included b

add external jar file to Spark shell vs. Scala Shell

2015-09-14 Thread Lan Jiang
Hi, there I ran into a problem when I try to pass external jar file to spark-shell. I have a uber jar file that contains all the java codes I created for protobuf and all its dependency. If I simply execute my code using Scala Shell, it works fine without error. I use -cp to pass the

Change protobuf version or any other third party library version in Spark application

2015-09-14 Thread Lan Jiang
Hi, there, I am using Spark 1.4.1. The protobuf 2.5 is included by Spark 1.4.1 by default. However, I would like to use Protobuf 3 in my spark application so that I can use some new features such as Map support. Is there anyway to do that? Right now if I build a uber.jar with dependencies

Re: Scheduling across applications - Need suggestion

2015-04-22 Thread Lan Jiang
YARN capacity scheduler support hierarchical queues, which you can assign cluster resource as percentage. Your spark application/shell can be submitted to different queues. Mesos supports fine-grained mode, which allows the machines/cores used each executors ramp up and down. Lan On Wed, Apr 22,

Re: Configuring logging properties for executor

2015-04-20 Thread Lan Jiang
Rename your log4j_special.properties file as log4j.properties and place it under the root of your jar file, you should be fine. If you are using Maven to build your jar, please the log4j.properties in the src/main/resources folder. However, please note that if you have other dependency jar

Re: Configuring logging properties for executor

2015-04-20 Thread Lan Jiang
that going this way I will not be able to run several applications on the same cluster in parallel? Regarding submit, I am not using it now, I submit from the code, but I think I should consider this option. Thanks. On Mon, Apr 20, 2015 at 5:59 PM, Lan Jiang ljia...@gmail.com