Re: Interested in contributing to SPARK-24815

2023-07-25 Thread Kent Yao
Hi Pavan, Refer to the ASF Source Header and Copyright Notice Policy[1], code directly submitted to ASF should include the Apache license header without any additional copyright notice. Kent Yao [1] https://www.apache.org/legal/src-headers.html#headers Sean Owen 于2023年7月25日周二 07:22写道

[ANNOUNCE] Apache Kyuubi (Incubating) released 1.5.0-incubating

2022-03-25 Thread Kent Yao
Hi all, The Apache Kyuubi (Incubating) community is pleased to announce that Apache Kyuubi (Incubating) 1.5.0-incubating has been released! Apache Kyuubi (Incubating) is a distributed multi-tenant JDBC server for large-scale data processing and analytics, built on top of Apache Spark and

Re: Spark version verification

2021-03-21 Thread Kent Yao
tps://github.com/apache/spark/tags Bests, Kent Yao @ Data Science Center, Hangzhou Research Institute, NetEase Corp.a spark enthusiastkyuubiis a unified multi-tenant JDBC interf

Re: Spark version verification

2021-03-21 Thread Kent Yao
Please refer to http://spark.apache.org/docs/latest/api/sql/index.html#version  Kent Yao @ Data Science Center, Hangzhou Research Institute

Re: [jira] [Commented] (SPARK-34648) Reading Parquet F =?utf-8?Q?iles_in_Spark_Extremely_Slow_for_Large_Number_of_Files??=

2021-03-10 Thread Kent Yao
Hi Pankaj,Have you tried spark.sql.parquet.respectSummaryFiles=true? Bests, Kent Yao @ Data Science Center, Hangzhou Research Institute, NetEase

Re:spark 3.1.1 support hive 1.2

2021-03-09 Thread Kent Yao
, Kent Yao @ Data Science Center, Hangzhou Research Institute, NetEase Corp.a spark enthusiastkyuubiis a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark.spark-authorizerA Spark

Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-02 Thread Kent Yao
Congrats, all! Bests, Kent Yao @ Data Science Center, Hangzhou Research Institute, NetEase Corp.a spark

unsubscribe

2020-03-07 Thread haitao .yao
unsubscribe -- haitao.yao

Re: I have trained a ML model, now what?

2019-01-23 Thread Pola Yao
Hi Riccardo, Right now, Spark does not support low-latency predictions in Production. MLeap is an alternative and it's been used in many scenarios. But it's good to see that Spark Community has decided to provide such support. On Wed, Jan 23, 2019 at 7:53 AM Riccardo Ferrari wrote: > Felix,

Re: How to force-quit a Spark application?

2019-01-22 Thread Pola Yao
anager" > thread, but I don't see that one in your list. > > On Wed, Jan 16, 2019 at 12:08 PM Pola Yao wrote: > > > > Hi Marcelo, > > > > Thanks for your response. > > > > I have dumped the threads on the server where I submitted the spark > applica

Re: How to force-quit a Spark application?

2019-01-16 Thread Pola Yao
AM Marcelo Vanzin wrote: > If System.exit() doesn't work, you may have a bigger problem > somewhere. Check your threads (using e.g. jstack) to see what's going > on. > > On Wed, Jan 16, 2019 at 8:09 AM Pola Yao wrote: > > > > Hi Marcelo, > > > > Thanks for

Re: How to force-quit a Spark application?

2019-01-16 Thread Pola Yao
if > something is creating a non-daemon thread that stays alive somewhere, > you'll see that. > > Or you can force quit with sys.exit. > > On Tue, Jan 15, 2019 at 1:30 PM Pola Yao wrote: > > > > I submitted a Spark job through ./spark-submit command, the code was > exe

How to force-quit a Spark application?

2019-01-15 Thread Pola Yao
I submitted a Spark job through ./spark-submit command, the code was executed successfully, however, the application got stuck when trying to quit spark. My code snippet: ''' { val spark = SparkSession.builder.master(...).getOrCreate val pool = Executors.newFixedThreadPool(3) implicit val xc =

[Spark-ml]Error in training ML models: Missing an output location for shuffle xxx

2019-01-07 Thread Pola Yao
Hi Spark Comminuty, I was using XGBoost-spark to train a machine learning model. The dataset was not large (around 1G). And I used the following command to submit my application: ''' ./bin/spark-submit --master yarn --deploy-mode client --num-executors 50 --executor-cores 2 --executor-memory 3g

[spark-ml] How to write a Spark Application correctly?

2019-01-02 Thread Pola Yao
Hello Spark Community, I have a dataset of size 20G, 20 columns. Each column is categorical, so I applied string-indexer and one-hot-encoding on every column. After, I applied vector-assembler on all the newly derived columns to form a feature vector for each record, and then feed the feature

Fwd: Train multiple machine learning models in parallel

2018-12-19 Thread Pola Yao
Hi Comminuty, I have a 1T dataset which contains records for 50 users. Each user has 20G data averagely. I wanted to use spark to train a machine learning model (e.g., XGBoost tree model) for each user. Ideally, the result should be 50 models. However, it'd be infeasible to submit 50 spark jobs

Re: Cosine Similarity between documents - Rows

2017-11-27 Thread Ge, Yao (Y.)
You are essential doing document clustering. K-means will do it. You do have to specify the number of clusters up front. Sent from Email+ secured by MobileIron From: "Donni Khan" > Date:

Re: Getting exit code of pipe()

2017-02-12 Thread Xuchen Yao
an error exit code? > > You could set checkCode to True > spark.apache.org/docs/latest/api/python/pyspark.html? > highlight=pipe#pyspark.RDD.pipe > > Otherwise maybe you want to output the status into stdout so you could > process it individually. > > > _____

Getting exit code of pipe()

2017-02-10 Thread Xuchen Yao
Hello Community, I have the following Python code that calls an external command: rdd.pipe('run.sh', env=os.environ).collect() run.sh can either exit with status 1 or 0, how could I get the exit code from Python? Thanks! Xuchen

Re: Word2Vec distributed?

2015-12-20 Thread Yao
I have the similar observation with 1.4.1 where the 3rd stage running mapPartitionsWithIndex at Word2Vec.scala:312 seems running with a single thread (which takes forever for reasonable large corpus). Can anyone help explain if this is an algorithm limitation or there model parameters can be

Spark scala REPL - Unable to create sqlContext

2015-10-25 Thread Yao
is problem with starting the shell in yarn-client mode. I am working with HDP2.2.6 which runs Hadoop 2.6. -Yao derby.log <http://apache-spark-user-list.1001560.n3.nabble.com/file/n25195/derby.log> -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-s

RE: Spark scala REPL - Unable to create sqlContext

2015-10-25 Thread Ge, Yao (Y.)
Thanks. I wonder why this is not widely reported in the user forum. The RELP shell is basically broken in 1.5 .0 and 1.5.1 -Yao From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Sunday, October 25, 2015 12:01 PM To: Ge, Yao (Y.) Cc: user Subject: Re: Spark scala REPL - Unable to create sqlContext

[SPARK-9776]Another instance of Derby may have already booted the database #8947

2015-10-23 Thread Ge, Yao (Y.)
. -Yao

Re: Possible long lineage issue when using DStream to update a normal RDD

2015-05-08 Thread Chunnan Yao
Thank you for this suggestion! But may I ask what's the advantage to use checkpoint instead of cache here? Cuz they both cut lineage. I only know checkpoint saves RDD in disk, while cache in memory. So may be it's for reliability? Also on

Re: Using TF-IDF from MLlib

2014-12-28 Thread Yao
I found the TF-IDF feature extraction and all the MLlib code that work with pure Vector RDD very difficult to work with due to the lack of ability to associate vector back to the original data. Why can't Spark MLlib support LabeledPoint? -- View this message in context:

Re: TF-IDF in Spark 1.1.0

2014-12-28 Thread Yao
Can you show how to do IDF transform on tfWithId? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/TF-IDF-in-Spark-1-1-0-tp16389p20877.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Decision Tree with libsvmtools datasets

2014-12-10 Thread Ge, Yao (Y.)
trainErr = labelAndPreds.filter(r = r._1 != r._2).count.toDouble / data.count; println(Training Error = + trainErr); println(Learned classification tree model:\n + model); -Yao

Decision Tree with Categorical Features

2014-12-10 Thread Ge, Yao (Y.)
Can anyone provide an example code of using Categorical Features in Decision Tree? Thanks! -Yao

add support for separate GC log files for different executor

2014-11-05 Thread haitao .yao
Hey, guys. Here's my problem: While using the standalone mode, I always use the following args for executor: -XX:+PrintGCDetails -XX:+PrintGCDateStamps -verbose:gc -Xloggc:/tmp/spark.executor.gc.log ​ But as we know, hotspot JVM does not support variable substitution on -Xloggc parameter, which

spark_ec2.py for AWS region: cn-north-1, China

2014-11-04 Thread haitao .yao
Hi, Amazon aws started to provide service for China mainland, the region name is cn-north-1. But the script spark provides: spark_ec2.py will query ami id from https://github.com/mesos/spark-ec2/tree/v4/ami-list and there's no ami information for cn-north-1 region . Can anybody update the

Re: spark_ec2.py for AWS region: cn-north-1, China

2014-11-04 Thread haitao .yao
://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html cn-north-1 is not a supported region for EC2, as far as I can tell. There may be other AWS services that can use that region, but spark-ec2 relies on EC2. Nick On Tue, Nov 4, 2014 at 8:09 PM, haitao .yao yao.e

Re: spark_ec2.py for AWS region: cn-north-1, China

2014-11-04 Thread haitao .yao
/jira/secure/Dashboard.jspa to track this request? I can do it if you've never opened a JIRA issue before. Nick On Tue, Nov 4, 2014 at 9:03 PM, haitao .yao yao.e...@gmail.com wrote: I'm afraid not. We have been using EC2 instances in cn-north-1 region for a while. And the latest version of boto

scala.MatchError: class java.sql.Timestamp

2014-10-19 Thread Ge, Yao (Y.)
I am working with Spark 1.1.0 and I believe Timestamp is a supported data type for Spark SQL. However I keep getting this MatchError for java.sql.Timestamp when I try to use reflection to register a Java Bean with Timestamp field. Anything wrong with my code below? public

RE: scala.MatchError: class java.sql.Timestamp

2014-10-19 Thread Ge, Yao (Y.)
(RemoteTestRunner.java:197) From: Wang, Daoyuan [mailto:daoyuan.w...@intel.com] Sent: Sunday, October 19, 2014 10:31 AM To: Ge, Yao (Y.); user@spark.apache.org Subject: RE: scala.MatchError: class java.sql.Timestamp Can you provide the exception stack? Thanks, Daoyuan From: Ge, Yao (Y.) [mailto:y...@ford.com

Exception Logging

2014-10-16 Thread Ge, Yao (Y.)
I need help to better trap Exception in the map functions. What is the best way to catch the exception and provide some helpful diagnostic information such as source of the input such as file name (and ideally line number if I am processing a text file)? -Yao

RE: Dedup

2014-10-09 Thread Ge, Yao (Y.)
much Sean! -Yao -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Thursday, October 09, 2014 3:04 AM To: Ge, Yao (Y.) Cc: user@spark.apache.org Subject: Re: Dedup I think the question is about copying the argument. If it's an immutable value like String, yes just

Dedup

2014-10-08 Thread Ge, Yao (Y.)
of the first argument. Is there are better way to do dedup in Spark? -Yao

RE: KMeans - java.lang.IllegalArgumentException: requirement failed

2014-08-12 Thread Ge, Yao (Y.)
array will need to be in ascending order. In many cases, it probably easier to use other two forms of Vectors.sparse functions if the indices and value positions are not naturally sorted. -Yao From: Ge, Yao (Y.) Sent: Monday, August 11, 2014 11:44 PM To: 'u...@spark.incubator.apache.org

KMeans - java.lang.IllegalArgumentException: requirement failed

2014-08-11 Thread Ge, Yao (Y.)
) at org.apache.spark.mllib.clustering.KMeans$$anonfun$17.apply(KMeans.scala:267) What does this means? How do I troubleshoot this problem? Thanks. -Yao

Re: How can I implement eigenvalue decomposition in Spark?

2014-08-08 Thread Chunnan Yao
to the biggest eigenvalue s.toArray(0)*s.toArray(0)? xj @ Tokyo On Fri, Aug 8, 2014 at 12:07 PM, Chunnan Yao yaochun...@gmail.com wrote: Hi there, what you've suggested are all meaningful. But to make myself clearer, my essential problems are: 1. My matrix is asymmetric

Re: How can I implement eigenvalue decomposition in Spark?

2014-08-07 Thread Chunnan Yao
both eigenvalues and eigenvectors or at least the biggest eigenvalue and the corresponding eigenvector, it seems that current Spark doesn't have such API. Is it possible that I write eigenvalue decomposition from scratch? What should I do? Thanks a lot! Miles Yao

Driver OOM while using reduceByKey

2014-05-29 Thread haitao .yao
Hi, I used 1g memory for the driver java process and got OOM error on driver side before reduceByKey. After analyzed the heap dump, the biggest object is org.apache.spark.MapStatus, which occupied over 900MB memory. Here's my question: 1. Is there any optimization switches that I can tune

Re: Driver OOM while using reduceByKey

2014-05-29 Thread haitao .yao
reduceByKey(_ + _, 100) to use only 100 tasks). Matei On May 29, 2014, at 2:03 AM, haitao .yao yao.e...@gmail.com wrote: Hi, I used 1g memory for the driver java process and got OOM error on driver side before reduceByKey. After analyzed the heap dump, the biggest object