Re: Set TimeOut and continue with other tasks

2019-07-10 Thread Wei Chen
I am currently trying to use Future Await to set a timeout inside the
map-reduce.
However, the tasks now fail instead of stuck, even if I have a Try Match to
catch it.
Doesn't anyone have an idea why?

The code is like

```Scala
files.map { file =>
  Try {
def tmpFunc(): Boolean = { FILE CONVERTION ON HDFS }
val tmpFuture = Future[Boolean] { tmpFunc() }
Await.result(tmpFuture, 600 seconds)
  } match {
case Failure(e) => "F"
case Success(r) => "S"
  }
}
```

The converter is created in a lazy function in a broadcast object,
which shouldn't be a problem.

Best Regards
Wei


On Wed, Jul 10, 2019 at 3:16 PM Gourav Sengupta 
wrote:

> Is there a way you can identify those patterns in a file or in its name
> and then just tackle them in separate jobs? I use the function
> input_file_name() to find the name of input file of each record and then
> filter out certain files.
>
> Regards,
> Gourav
>
> On Wed, Jul 10, 2019 at 6:47 AM Wei Chen  wrote:
>
>> Hello All,
>>
>> I am using spark to process some files parallelly.
>> While most files are able to be processed within 3 seconds,
>> it is possible that we stuck on 1 or 2 files as they will never finish
>> (or will take more than 48 hours).
>> Since it is a 3rd party file conversion tool, we are not able to debug
>> why the converter stuck at the time.
>>
>> Is it possible that we set a timeout for our process, throw exceptions
>> for those tasks,
>> while still continue with other successful tasks?
>>
>> Best Regards
>> Wei
>>
>


Set TimeOut and continue with other tasks

2019-07-09 Thread Wei Chen
Hello All,

I am using spark to process some files parallelly.
While most files are able to be processed within 3 seconds,
it is possible that we stuck on 1 or 2 files as they will never finish (or
will take more than 48 hours).
Since it is a 3rd party file conversion tool, we are not able to debug why
the converter stuck at the time.

Is it possible that we set a timeout for our process, throw exceptions for
those tasks,
while still continue with other successful tasks?

Best Regards
Wei


Re: how to get weights of logistic regression model inside cross validator model?

2016-04-20 Thread Wei Chen
Found it. In case someone else if looking for this:
cvModel.bestModel.asInstanceOf[org.apache.spark.ml.classification.LogisticRegressionModel].weights

On Tue, Apr 19, 2016 at 1:12 PM, Wei Chen <wei.chen.ri...@gmail.com> wrote:

> Hi All,
>
> I am using the example of model selection via cross-validation from the
> documentation here: http://spark.apache.org/docs/latest/ml-guide.html.
> After I get the "cvModel", I would like to see the weights for each feature
> for the best logistic regression model. I've been looking at the methods
> and attributes of this "cvModel" and "cvModel.bestModel" and still can't
> figure out where these weights are referred. It must be somewhere since we
> can use "cvModel" to transform a new dataframe. Your help is much
> appreciated.
>
>
> Thank you,
> Wei
>



-- 
Wei Chen, Ph.D.
Astronomer and Data Scientist
Phone: (832)646-7124
Email: wei.chen.ri...@gmail.com
LinkedIn: https://www.linkedin.com/in/weichen1984


Re: pyspark split pair rdd to multiple

2016-04-20 Thread Wei Chen
Let's assume K is String, and V is Integer,
schema = StructType([StructField("K", StringType(), True), StructField("V",
IntegerType(), True)])
df = sqlContext.createDataFrame(rdd, schema=schema)
udf1 = udf(lambda x: [x], ArrayType(IntegerType()))
df1 = df.select("K", udf1("V").alias("arrayV"))
df1.show()


On Tue, Apr 19, 2016 at 12:51 PM, pth001 <patcharee.thong...@uni.no> wrote:

> Hi,
>
> How can I split pair rdd [K, V] to map [K, Array(V)] efficiently in
> Pyspark?
>
> Best,
> Patcharee
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Wei Chen, Ph.D.
Astronomer and Data Scientist
Phone: (832)646-7124
Email: wei.chen.ri...@gmail.com
LinkedIn: https://www.linkedin.com/in/weichen1984


Re: how to get weights of logistic regression model inside cross validator model?

2016-04-20 Thread Wei Chen
Forgot to mention, I am using 1.5.2 Scala version.

On Tue, Apr 19, 2016 at 1:12 PM, Wei Chen <wei.chen.ri...@gmail.com> wrote:

> Hi All,
>
> I am using the example of model selection via cross-validation from the
> documentation here: http://spark.apache.org/docs/latest/ml-guide.html.
> After I get the "cvModel", I would like to see the weights for each feature
> for the best logistic regression model. I've been looking at the methods
> and attributes of this "cvModel" and "cvModel.bestModel" and still can't
> figure out where these weights are referred. It must be somewhere since we
> can use "cvModel" to transform a new dataframe. Your help is much
> appreciated.
>
>
> Thank you,
> Wei
>



-- 
Wei Chen, Ph.D.
Astronomer and Data Scientist
Phone: (832)646-7124
Email: wei.chen.ri...@gmail.com
LinkedIn: https://www.linkedin.com/in/weichen1984


how to get weights of logistic regression model inside cross validator model?

2016-04-19 Thread Wei Chen
Hi All,

I am using the example of model selection via cross-validation from the
documentation here: http://spark.apache.org/docs/latest/ml-guide.html.
After I get the "cvModel", I would like to see the weights for each feature
for the best logistic regression model. I've been looking at the methods
and attributes of this "cvModel" and "cvModel.bestModel" and still can't
figure out where these weights are referred. It must be somewhere since we
can use "cvModel" to transform a new dataframe. Your help is much
appreciated.


Thank you,
Wei


optimal way to load parquet files with partition

2016-02-02 Thread Wei Chen
Hi All,

I have data partitioned by year=/month=mm/day=dd, what is the best way
to get two months of data from a given year (let's say June and July)?

Two ways I can think of:
1. use unionAll
df1 = sqc.read.parquet('xxx/year=2015/month=6')
df2 = sqc.read.parquet('xxx/year=2015/month=7')
df = df1.unionAll(df2)

2. use filter after load the whole year
df = sqc.read.parquet('xxx/year=2015/').filter('month in (6, 7)')

Which of the above is better? Or are there better ways to handle this?


Thank you,
Wei


Re: pyspark dataframe: row with a minimum value of a column for each group

2016-01-06 Thread Wei Chen
Thank you. I have tried the window function as follows:

import pyspark.sql.functions as f
sqc = sqlContext
from pyspark.sql import Window
import pandas as pd

DF = pd.DataFrame({'a': [1,1,1,2,2,2,3,3,3],
   'b': [1,2,3,1,2,3,1,2,3],
   'c': [1,2,3,4,5,6,7,8,9]
  })

df = sqc.createDataFrame(DF)

window = Window.partitionBy("a").orderBy("c")

df.select('a', 'b', 'c', f.min('c').over(window).alias('y')).show()

I got the following result which is understandable:

+---+---+---+---+
|  a|  b|  c|  y|
+---+---+---+---+
|  1|  1|  1|  1|
|  1|  2|  2|  1|
|  1|  3|  3|  1|
|  2|  1|  4|  4|
|  2|  2|  5|  4|
|  2|  3|  6|  4|
|  3|  1|  7|  7|
|  3|  2|  8|  7|
|  3|  3|  9|  7|
+---+---+---+---+


However if I change min to max, the result is not what is expected:

df.select('a', 'b', 'c', f.max('c').over(window).alias('y')).show() gives

+---+---+---+---+
|  a|  b|  c|  y|
+---+---+---+---+
|  1|  1|  1|  1|
|  1|  2|  2|  2|
|  1|  3|  3|  3|
|  2|  1|  4|  4|
|  2|  2|  5|  5|
|  2|  3|  6|  6|
|  3|  1|  7|  7|
|  3|  2|  8|  8|
|  3|  3|  9|  9|
+---+---+---+---+



Thanks,

Wei


On Tue, Jan 5, 2016 at 8:30 PM, ayan guha <guha.a...@gmail.com> wrote:

> Yes there is. It is called window function over partitions.
>
> Equivalent SQL would be:
>
> select * from
>  (select a,b,c, rank() over (partition by a order by b) r from df)
> x
> where r = 1
>
> You can register your DF as a temp table and use the sql form. Or, (>Spark
> 1.4) you can use window methods and their variants in Spark SQL module.
>
> HTH
>
> On Wed, Jan 6, 2016 at 11:56 AM, Wei Chen <wei.chen.ri...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am trying to retrieve the rows with a minimum value of a column for
>> each group. For example: the following dataframe:
>>
>> a | b | c
>> --
>> 1 | 1 | 1
>> 1 | 2 | 2
>> 1 | 3 | 3
>> 2 | 1 | 4
>> 2 | 2 | 5
>> 2 | 3 | 6
>> 3 | 1 | 7
>> 3 | 2 | 8
>> 3 | 3 | 9
>> --
>>
>> I group by 'a', and want the rows with the smallest 'b', that is, I want
>> to return the following dataframe:
>>
>> a | b | c
>> --
>> 1 | 1 | 1
>> 2 | 1 | 4
>> 3 | 1 | 7
>> --
>>
>> The dataframe I have is huge so get the minimum value of b from each
>> group and joining on the original dataframe is very expensive. Is there a
>> better way to do this?
>>
>>
>> Thanks,
>> Wei
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>



-- 
Wei Chen, Ph.D.
Astronomer and Data Scientist
Phone: (832)646-7124
Email: wei.chen.ri...@gmail.com
LinkedIn: https://www.linkedin.com/in/weichen1984


pyspark dataframe: row with a minimum value of a column for each group

2016-01-05 Thread Wei Chen
Hi,

I am trying to retrieve the rows with a minimum value of a column for each
group. For example: the following dataframe:

a | b | c
--
1 | 1 | 1
1 | 2 | 2
1 | 3 | 3
2 | 1 | 4
2 | 2 | 5
2 | 3 | 6
3 | 1 | 7
3 | 2 | 8
3 | 3 | 9
--

I group by 'a', and want the rows with the smallest 'b', that is, I want to
return the following dataframe:

a | b | c
--
1 | 1 | 1
2 | 1 | 4
3 | 1 | 7
--

The dataframe I have is huge so get the minimum value of b from each group
and joining on the original dataframe is very expensive. Is there a better
way to do this?


Thanks,
Wei


UDAF support in PySpark?

2015-12-15 Thread Wei Chen
Hi,

I am wondering if there is UDAF support in PySpark with Spark 1.5. If not,
is Spark 1.6 going to incorporate that?

Thanks,
Wei


UDAF support in PySpark?

2015-12-15 Thread Wei Chen
Hi,

I am wondering if there is UDAF support in PySpark with Spark 1.5. If not,
is Spark 1.6 going to incorporate that?

Thanks,
Wei


pyspark sql: number of partitions and partition by size?

2015-11-13 Thread Wei Chen
Hey Friends,

I am trying to use sqlContext.write.parquet() to write dataframe to parquet
files. I have the following questions.

1. number of partitions
The default number of partition seems to be 200. Is there any way other
than using df.repartition(n) to change this number? I was told repartition
can be very expensive.

2. partition by size
When I use df.partitionBy(['year']), if the number of entries with
"year=2006" is very small, the sizes of the files under partition
"year=2006" can be very small. If we can assign a size to each partition
file, that'll be very helpful.


Thank you,
Wei


Is there anyway to do partition discovery without 'field=' in folder names?

2015-11-06 Thread Wei Chen
Hey Friends,

I've been using partition discovery with folder structures that have
"field=" in folder names. However, I've also encountered a lot of folders
structures without "field=" in folder names, especially when it is year,
month, day. Is there anyway that we can assign a field to each level of
this folder structure (xx/2014/03/04/) and do partition discovery?


Thank you,
Wei


different Row objects?

2015-09-03 Thread Wei Chen
Hey Friends,

Recently I have been using Spark 1.3.1, mainly pyspark.sql. I noticed that
the Row object collected directly from a DataFrame is different from the
Row object we directly defined from Row(*arg, **kwarg).

>>>from pyspark.sql.types import Row
>>>aaa = Row(a=1, b=2, c=Row(a=1, b=2))
>>>tuple(sc.parallelize([aaa]).toDF().collect()[0])

(1, 2, (1, 2))

>>>tuple(aaa)

(1, 2, Row(a=1, b=2))


This matters to me because I wanted to be able to create a DataFrame
with one of the columns being a Row object by
sqlcontext.createDataFrame(data, schema) where I specifically pass in
the schema. However, if the data is RDD of Row objects like "aaa" in
my example, it'll fail in __verify_type function.



Thank you,

Wei