Re: Set TimeOut and continue with other tasks

2019-07-10 Thread Wei Chen
rd and then > filter out certain files. > > Regards, > Gourav > > On Wed, Jul 10, 2019 at 6:47 AM Wei Chen wrote: > >> Hello All, >> >> I am using spark to process some files parallelly. >> While most files are able to be processed within 3

Set TimeOut and continue with other tasks

2019-07-09 Thread Wei Chen
Hello All, I am using spark to process some files parallelly. While most files are able to be processed within 3 seconds, it is possible that we stuck on 1 or 2 files as they will never finish (or will take more than 48 hours). Since it is a 3rd party file conversion tool, we are not able to

Re: how to get weights of logistic regression model inside cross validator model?

2016-04-20 Thread Wei Chen
Found it. In case someone else if looking for this: cvModel.bestModel.asInstanceOf[org.apache.spark.ml.classification.LogisticRegressionModel].weights On Tue, Apr 19, 2016 at 1:12 PM, Wei Chen <wei.chen.ri...@gmail.com> wrote: > Hi All, > > I am using the example of model sele

Re: pyspark split pair rdd to multiple

2016-04-20 Thread Wei Chen
-- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Wei Chen, Ph.D. Astronomer and Data Scientist Phone: (832)646-7124 Email: wei.chen.ri...@gmail.com LinkedIn: https://www.linkedin.com/in/weichen1984

Re: how to get weights of logistic regression model inside cross validator model?

2016-04-20 Thread Wei Chen
Forgot to mention, I am using 1.5.2 Scala version. On Tue, Apr 19, 2016 at 1:12 PM, Wei Chen <wei.chen.ri...@gmail.com> wrote: > Hi All, > > I am using the example of model selection via cross-validation from the > documentation here: http://spark.apache.org/docs/latest/ml-gui

how to get weights of logistic regression model inside cross validator model?

2016-04-19 Thread Wei Chen
Hi All, I am using the example of model selection via cross-validation from the documentation here: http://spark.apache.org/docs/latest/ml-guide.html. After I get the "cvModel", I would like to see the weights for each feature for the best logistic regression model. I've been looking at the

optimal way to load parquet files with partition

2016-02-02 Thread Wei Chen
Hi All, I have data partitioned by year=/month=mm/day=dd, what is the best way to get two months of data from a given year (let's say June and July)? Two ways I can think of: 1. use unionAll df1 = sqc.read.parquet('xxx/year=2015/month=6') df2 = sqc.read.parquet('xxx/year=2015/month=7') df =

Re: pyspark dataframe: row with a minimum value of a column for each group

2016-01-06 Thread Wei Chen
an register your DF as a temp table and use the sql form. Or, (>Spark > 1.4) you can use window methods and their variants in Spark SQL module. > > HTH > > On Wed, Jan 6, 2016 at 11:56 AM, Wei Chen <wei.chen.ri...@gmail.com> > wrote: > >> Hi, >> >> I am tr

pyspark dataframe: row with a minimum value of a column for each group

2016-01-05 Thread Wei Chen
Hi, I am trying to retrieve the rows with a minimum value of a column for each group. For example: the following dataframe: a | b | c -- 1 | 1 | 1 1 | 2 | 2 1 | 3 | 3 2 | 1 | 4 2 | 2 | 5 2 | 3 | 6 3 | 1 | 7 3 | 2 | 8 3 | 3 | 9 -- I group by 'a', and want the rows with the

UDAF support in PySpark?

2015-12-15 Thread Wei Chen
Hi, I am wondering if there is UDAF support in PySpark with Spark 1.5. If not, is Spark 1.6 going to incorporate that? Thanks, Wei

UDAF support in PySpark?

2015-12-15 Thread Wei Chen
Hi, I am wondering if there is UDAF support in PySpark with Spark 1.5. If not, is Spark 1.6 going to incorporate that? Thanks, Wei

pyspark sql: number of partitions and partition by size?

2015-11-13 Thread Wei Chen
Hey Friends, I am trying to use sqlContext.write.parquet() to write dataframe to parquet files. I have the following questions. 1. number of partitions The default number of partition seems to be 200. Is there any way other than using df.repartition(n) to change this number? I was told

Is there anyway to do partition discovery without 'field=' in folder names?

2015-11-06 Thread Wei Chen
Hey Friends, I've been using partition discovery with folder structures that have "field=" in folder names. However, I've also encountered a lot of folders structures without "field=" in folder names, especially when it is year, month, day. Is there anyway that we can assign a field to each level

different Row objects?

2015-09-03 Thread Wei Chen
Hey Friends, Recently I have been using Spark 1.3.1, mainly pyspark.sql. I noticed that the Row object collected directly from a DataFrame is different from the Row object we directly defined from Row(*arg, **kwarg). >>>from pyspark.sql.types import Row >>>aaa = Row(a=1, b=2, c=Row(a=1, b=2))