from:"Wei Chen"

Re: Set TimeOut and continue with other tasks

2019-07-10 Thread Wei Chen

each record and then > filter out certain files. > > Regards, > Gourav > > On Wed, Jul 10, 2019 at 6:47 AM Wei Chen wrote: > >> Hello All, >> >> I am using spark to process some files parallelly. >> While most files are able to be processed within 3 s

Set TimeOut and continue with other tasks

2019-07-09 Thread Wei Chen

Hello All, I am using spark to process some files parallelly. While most files are able to be processed within 3 seconds, it is possible that we stuck on 1 or 2 files as they will never finish (or will take more than 48 hours). Since it is a 3rd party file conversion tool, we are not able to debug

Re: how to get weights of logistic regression model inside cross validator model?

2016-04-20 Thread Wei Chen

Found it. In case someone else if looking for this: cvModel.bestModel.asInstanceOf[org.apache.spark.ml.classification.LogisticRegressionModel].weights On Tue, Apr 19, 2016 at 1:12 PM, Wei Chen wrote: > Hi All, > > I am using the example of model selection via cross-validation

Re: pyspark split pair rdd to multiple

2016-04-20 Thread Wei Chen

-- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Wei Chen, Ph.D. Astronomer and Data Scientist Phone: (832)646-7124 Email: wei.chen.ri...@gmail.com LinkedIn: https://www.linkedin.com/in/weichen1984

Re: how to get weights of logistic regression model inside cross validator model?

2016-04-20 Thread Wei Chen

Forgot to mention, I am using 1.5.2 Scala version. On Tue, Apr 19, 2016 at 1:12 PM, Wei Chen wrote: > Hi All, > > I am using the example of model selection via cross-validation from the > documentation here: http://spark.apache.org/docs/latest/ml-guide.html. > After I get the &qu

how to get weights of logistic regression model inside cross validator model?

2016-04-19 Thread Wei Chen

Hi All, I am using the example of model selection via cross-validation from the documentation here: http://spark.apache.org/docs/latest/ml-guide.html. After I get the "cvModel", I would like to see the weights for each feature for the best logistic regression model. I've been looking at the method

optimal way to load parquet files with partition

2016-02-02 Thread Wei Chen

Hi All, I have data partitioned by year=/month=mm/day=dd, what is the best way to get two months of data from a given year (let's say June and July)? Two ways I can think of: 1. use unionAll df1 = sqc.read.parquet('xxx/year=2015/month=6') df2 = sqc.read.parquet('xxx/year=2015/month=7') df = d

Re: pyspark dataframe: row with a minimum value of a column for each group

2016-01-06 Thread Wei Chen

,b,c, rank() over (partition by a order by b) r from df) > x > where r = 1 > > You can register your DF as a temp table and use the sql form. Or, (>Spark > 1.4) you can use window methods and their variants in Spark SQL module. > > HTH > > On Wed, Jan 6, 2016 at 11:

pyspark dataframe: row with a minimum value of a column for each group

2016-01-05 Thread Wei Chen

Hi, I am trying to retrieve the rows with a minimum value of a column for each group. For example: the following dataframe: a | b | c -- 1 | 1 | 1 1 | 2 | 2 1 | 3 | 3 2 | 1 | 4 2 | 2 | 5 2 | 3 | 6 3 | 1 | 7 3 | 2 | 8 3 | 3 | 9 -- I group by 'a', and want the rows with the smalles

UDAF support in PySpark?

2015-12-15 Thread Wei Chen

Hi, I am wondering if there is UDAF support in PySpark with Spark 1.5. If not, is Spark 1.6 going to incorporate that? Thanks, Wei

UDAF support in PySpark?

2015-12-15 Thread Wei Chen

Hi, I am wondering if there is UDAF support in PySpark with Spark 1.5. If not, is Spark 1.6 going to incorporate that? Thanks, Wei

pyspark sql: number of partitions and partition by size?

2015-11-13 Thread Wei Chen

Hey Friends, I am trying to use sqlContext.write.parquet() to write dataframe to parquet files. I have the following questions. 1. number of partitions The default number of partition seems to be 200. Is there any way other than using df.repartition(n) to change this number? I was told repartitio

Is there anyway to do partition discovery without 'field=' in folder names?

2015-11-06 Thread Wei Chen

Hey Friends, I've been using partition discovery with folder structures that have "field=" in folder names. However, I've also encountered a lot of folders structures without "field=" in folder names, especially when it is year, month, day. Is there anyway that we can assign a field to each level

different Row objects?

2015-09-03 Thread Wei Chen

Hey Friends, Recently I have been using Spark 1.3.1, mainly pyspark.sql. I noticed that the Row object collected directly from a DataFrame is different from the Row object we directly defined from Row(*arg, **kwarg). >>>from pyspark.sql.types import Row >>>aaa = Row(a=1, b=2, c=Row(a=1, b=2)) >>>

Re: Set TimeOut and continue with other tasks

Set TimeOut and continue with other tasks

Re: how to get weights of logistic regression model inside cross validator model?

Re: pyspark split pair rdd to multiple

Re: how to get weights of logistic regression model inside cross validator model?

how to get weights of logistic regression model inside cross validator model?

optimal way to load parquet files with partition

Re: pyspark dataframe: row with a minimum value of a column for each group

pyspark dataframe: row with a minimum value of a column for each group

UDAF support in PySpark?

UDAF support in PySpark?

pyspark sql: number of partitions and partition by size?

Is there anyway to do partition discovery without 'field=' in folder names?

different Row objects?

14 matches

Site Navigation

Mail list logo

Footer information