Re: Set TimeOut and continue with other tasks
I am currently trying to use Future Await to set a timeout inside the map-reduce. However, the tasks now fail instead of stuck, even if I have a Try Match to catch it. Doesn't anyone have an idea why? The code is like ```Scala files.map { file => Try { def tmpFunc(): Boolean = { FILE CONVERTION ON HDFS } val tmpFuture = Future[Boolean] { tmpFunc() } Await.result(tmpFuture, 600 seconds) } match { case Failure(e) => "F" case Success(r) => "S" } } ``` The converter is created in a lazy function in a broadcast object, which shouldn't be a problem. Best Regards Wei On Wed, Jul 10, 2019 at 3:16 PM Gourav Sengupta wrote: > Is there a way you can identify those patterns in a file or in its name > and then just tackle them in separate jobs? I use the function > input_file_name() to find the name of input file of each record and then > filter out certain files. > > Regards, > Gourav > > On Wed, Jul 10, 2019 at 6:47 AM Wei Chen wrote: > >> Hello All, >> >> I am using spark to process some files parallelly. >> While most files are able to be processed within 3 seconds, >> it is possible that we stuck on 1 or 2 files as they will never finish >> (or will take more than 48 hours). >> Since it is a 3rd party file conversion tool, we are not able to debug >> why the converter stuck at the time. >> >> Is it possible that we set a timeout for our process, throw exceptions >> for those tasks, >> while still continue with other successful tasks? >> >> Best Regards >> Wei >> >
Set TimeOut and continue with other tasks
Hello All, I am using spark to process some files parallelly. While most files are able to be processed within 3 seconds, it is possible that we stuck on 1 or 2 files as they will never finish (or will take more than 48 hours). Since it is a 3rd party file conversion tool, we are not able to debug why the converter stuck at the time. Is it possible that we set a timeout for our process, throw exceptions for those tasks, while still continue with other successful tasks? Best Regards Wei
Re: how to get weights of logistic regression model inside cross validator model?
Found it. In case someone else if looking for this: cvModel.bestModel.asInstanceOf[org.apache.spark.ml.classification.LogisticRegressionModel].weights On Tue, Apr 19, 2016 at 1:12 PM, Wei Chen <wei.chen.ri...@gmail.com> wrote: > Hi All, > > I am using the example of model selection via cross-validation from the > documentation here: http://spark.apache.org/docs/latest/ml-guide.html. > After I get the "cvModel", I would like to see the weights for each feature > for the best logistic regression model. I've been looking at the methods > and attributes of this "cvModel" and "cvModel.bestModel" and still can't > figure out where these weights are referred. It must be somewhere since we > can use "cvModel" to transform a new dataframe. Your help is much > appreciated. > > > Thank you, > Wei > -- Wei Chen, Ph.D. Astronomer and Data Scientist Phone: (832)646-7124 Email: wei.chen.ri...@gmail.com LinkedIn: https://www.linkedin.com/in/weichen1984
Re: pyspark split pair rdd to multiple
Let's assume K is String, and V is Integer, schema = StructType([StructField("K", StringType(), True), StructField("V", IntegerType(), True)]) df = sqlContext.createDataFrame(rdd, schema=schema) udf1 = udf(lambda x: [x], ArrayType(IntegerType())) df1 = df.select("K", udf1("V").alias("arrayV")) df1.show() On Tue, Apr 19, 2016 at 12:51 PM, pth001 <patcharee.thong...@uni.no> wrote: > Hi, > > How can I split pair rdd [K, V] to map [K, Array(V)] efficiently in > Pyspark? > > Best, > Patcharee > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Wei Chen, Ph.D. Astronomer and Data Scientist Phone: (832)646-7124 Email: wei.chen.ri...@gmail.com LinkedIn: https://www.linkedin.com/in/weichen1984
Re: how to get weights of logistic regression model inside cross validator model?
Forgot to mention, I am using 1.5.2 Scala version. On Tue, Apr 19, 2016 at 1:12 PM, Wei Chen <wei.chen.ri...@gmail.com> wrote: > Hi All, > > I am using the example of model selection via cross-validation from the > documentation here: http://spark.apache.org/docs/latest/ml-guide.html. > After I get the "cvModel", I would like to see the weights for each feature > for the best logistic regression model. I've been looking at the methods > and attributes of this "cvModel" and "cvModel.bestModel" and still can't > figure out where these weights are referred. It must be somewhere since we > can use "cvModel" to transform a new dataframe. Your help is much > appreciated. > > > Thank you, > Wei > -- Wei Chen, Ph.D. Astronomer and Data Scientist Phone: (832)646-7124 Email: wei.chen.ri...@gmail.com LinkedIn: https://www.linkedin.com/in/weichen1984
how to get weights of logistic regression model inside cross validator model?
Hi All, I am using the example of model selection via cross-validation from the documentation here: http://spark.apache.org/docs/latest/ml-guide.html. After I get the "cvModel", I would like to see the weights for each feature for the best logistic regression model. I've been looking at the methods and attributes of this "cvModel" and "cvModel.bestModel" and still can't figure out where these weights are referred. It must be somewhere since we can use "cvModel" to transform a new dataframe. Your help is much appreciated. Thank you, Wei
optimal way to load parquet files with partition
Hi All, I have data partitioned by year=/month=mm/day=dd, what is the best way to get two months of data from a given year (let's say June and July)? Two ways I can think of: 1. use unionAll df1 = sqc.read.parquet('xxx/year=2015/month=6') df2 = sqc.read.parquet('xxx/year=2015/month=7') df = df1.unionAll(df2) 2. use filter after load the whole year df = sqc.read.parquet('xxx/year=2015/').filter('month in (6, 7)') Which of the above is better? Or are there better ways to handle this? Thank you, Wei
Re: pyspark dataframe: row with a minimum value of a column for each group
Thank you. I have tried the window function as follows: import pyspark.sql.functions as f sqc = sqlContext from pyspark.sql import Window import pandas as pd DF = pd.DataFrame({'a': [1,1,1,2,2,2,3,3,3], 'b': [1,2,3,1,2,3,1,2,3], 'c': [1,2,3,4,5,6,7,8,9] }) df = sqc.createDataFrame(DF) window = Window.partitionBy("a").orderBy("c") df.select('a', 'b', 'c', f.min('c').over(window).alias('y')).show() I got the following result which is understandable: +---+---+---+---+ | a| b| c| y| +---+---+---+---+ | 1| 1| 1| 1| | 1| 2| 2| 1| | 1| 3| 3| 1| | 2| 1| 4| 4| | 2| 2| 5| 4| | 2| 3| 6| 4| | 3| 1| 7| 7| | 3| 2| 8| 7| | 3| 3| 9| 7| +---+---+---+---+ However if I change min to max, the result is not what is expected: df.select('a', 'b', 'c', f.max('c').over(window).alias('y')).show() gives +---+---+---+---+ | a| b| c| y| +---+---+---+---+ | 1| 1| 1| 1| | 1| 2| 2| 2| | 1| 3| 3| 3| | 2| 1| 4| 4| | 2| 2| 5| 5| | 2| 3| 6| 6| | 3| 1| 7| 7| | 3| 2| 8| 8| | 3| 3| 9| 9| +---+---+---+---+ Thanks, Wei On Tue, Jan 5, 2016 at 8:30 PM, ayan guha <guha.a...@gmail.com> wrote: > Yes there is. It is called window function over partitions. > > Equivalent SQL would be: > > select * from > (select a,b,c, rank() over (partition by a order by b) r from df) > x > where r = 1 > > You can register your DF as a temp table and use the sql form. Or, (>Spark > 1.4) you can use window methods and their variants in Spark SQL module. > > HTH > > On Wed, Jan 6, 2016 at 11:56 AM, Wei Chen <wei.chen.ri...@gmail.com> > wrote: > >> Hi, >> >> I am trying to retrieve the rows with a minimum value of a column for >> each group. For example: the following dataframe: >> >> a | b | c >> -- >> 1 | 1 | 1 >> 1 | 2 | 2 >> 1 | 3 | 3 >> 2 | 1 | 4 >> 2 | 2 | 5 >> 2 | 3 | 6 >> 3 | 1 | 7 >> 3 | 2 | 8 >> 3 | 3 | 9 >> -- >> >> I group by 'a', and want the rows with the smallest 'b', that is, I want >> to return the following dataframe: >> >> a | b | c >> -- >> 1 | 1 | 1 >> 2 | 1 | 4 >> 3 | 1 | 7 >> -- >> >> The dataframe I have is huge so get the minimum value of b from each >> group and joining on the original dataframe is very expensive. Is there a >> better way to do this? >> >> >> Thanks, >> Wei >> >> > > > -- > Best Regards, > Ayan Guha > -- Wei Chen, Ph.D. Astronomer and Data Scientist Phone: (832)646-7124 Email: wei.chen.ri...@gmail.com LinkedIn: https://www.linkedin.com/in/weichen1984
pyspark dataframe: row with a minimum value of a column for each group
Hi, I am trying to retrieve the rows with a minimum value of a column for each group. For example: the following dataframe: a | b | c -- 1 | 1 | 1 1 | 2 | 2 1 | 3 | 3 2 | 1 | 4 2 | 2 | 5 2 | 3 | 6 3 | 1 | 7 3 | 2 | 8 3 | 3 | 9 -- I group by 'a', and want the rows with the smallest 'b', that is, I want to return the following dataframe: a | b | c -- 1 | 1 | 1 2 | 1 | 4 3 | 1 | 7 -- The dataframe I have is huge so get the minimum value of b from each group and joining on the original dataframe is very expensive. Is there a better way to do this? Thanks, Wei
UDAF support in PySpark?
Hi, I am wondering if there is UDAF support in PySpark with Spark 1.5. If not, is Spark 1.6 going to incorporate that? Thanks, Wei
UDAF support in PySpark?
Hi, I am wondering if there is UDAF support in PySpark with Spark 1.5. If not, is Spark 1.6 going to incorporate that? Thanks, Wei
pyspark sql: number of partitions and partition by size?
Hey Friends, I am trying to use sqlContext.write.parquet() to write dataframe to parquet files. I have the following questions. 1. number of partitions The default number of partition seems to be 200. Is there any way other than using df.repartition(n) to change this number? I was told repartition can be very expensive. 2. partition by size When I use df.partitionBy(['year']), if the number of entries with "year=2006" is very small, the sizes of the files under partition "year=2006" can be very small. If we can assign a size to each partition file, that'll be very helpful. Thank you, Wei
Is there anyway to do partition discovery without 'field=' in folder names?
Hey Friends, I've been using partition discovery with folder structures that have "field=" in folder names. However, I've also encountered a lot of folders structures without "field=" in folder names, especially when it is year, month, day. Is there anyway that we can assign a field to each level of this folder structure (xx/2014/03/04/) and do partition discovery? Thank you, Wei
different Row objects?
Hey Friends, Recently I have been using Spark 1.3.1, mainly pyspark.sql. I noticed that the Row object collected directly from a DataFrame is different from the Row object we directly defined from Row(*arg, **kwarg). >>>from pyspark.sql.types import Row >>>aaa = Row(a=1, b=2, c=Row(a=1, b=2)) >>>tuple(sc.parallelize([aaa]).toDF().collect()[0]) (1, 2, (1, 2)) >>>tuple(aaa) (1, 2, Row(a=1, b=2)) This matters to me because I wanted to be able to create a DataFrame with one of the columns being a Row object by sqlcontext.createDataFrame(data, schema) where I specifically pass in the schema. However, if the data is RDD of Row objects like "aaa" in my example, it'll fail in __verify_type function. Thank you, Wei