Re: how to see Pipeline model information

2016-11-27 Thread Zhiliang Zhu
I have worked it out, just let java call scala class function .Thank Xiaomeng a lot~~ On Friday, November 25, 2016 1:50 AM, Xiaomeng Wan wrote: here is the scala code I use to get the best model, I never used java     val cv = new 

Re: Why is shuffle write size so large when joining Dataset with nested structure?

2016-11-27 Thread Zhuo Tao
Hi Takeshi, Thank you for your comment. I changed it to RDD and it's a lot better. Zhuo On Fri, Nov 25, 2016 at 7:04 PM, Takeshi Yamamuro wrote: > Hi, > > I think this is just the overhead to represent nested elements as internal > rows on-runtime > (e.g., it consumes

Re: createDataFrame causing a strange error.

2016-11-27 Thread Marco Mistroni
Hi pickle erros normally point to serialisation issue. i am suspecting something wrong with ur S3 data , but is just a wild guess... Is your s3 object publicly available? few suggestions to nail down the problem 1 - try to see if you can read your object from s3 using boto3 library 'offline',

createDataFrame causing a strange error.

2016-11-27 Thread Andrew Holway
Hi, Can anyone tell me what is causing this error Spark 2.0.0 Python 2.7.5 df = sqlContext.createDataFrame(foo, schema) https://gist.github.com/mooperd/368e3453c29694c8b2c038d6b7b4413a Traceback (most recent call last): File "/home/centos/fun-functions/spark-parrallel-read-from-s3/tick.py",

Re: createDataFrame causing a strange error.

2016-11-27 Thread Andrew Holway
I get a slight different error when not specifying a schema: Traceback (most recent call last): File "/home/centos/fun-functions/spark-parrallel-read-from-s3/tick.py", line 61, in df = sqlContext.createDataFrame(foo) File

how to print auc & prc for GBTClassifier, which is okay for RandomForestClassifier

2016-11-27 Thread Zhiliang Zhu
Hi All, I need to print auc and prc for GBTClassifier model, it seems okay for RandomForestClassifier but not GBTClassifier, though rawPrediction column is neither in original data. the codes are : ..    // Set up Pipeline    val stages = new

Re: Third party library

2016-11-27 Thread Steve Loughran
On 27 Nov 2016, at 02:55, kant kodali > wrote: I would say instead of LD_LIBRARY_PATH you might want to use java.library.path in the following way java -Djava.library.path=/path/to/my/library or pass java.library.path along with spark-submit

[Spark R]: Does Spark R supports nonlinear optimization with nonlinear constraints.

2016-11-27 Thread himanshu.gpt
Hi, Component: Spark R Level: Beginner Scenario: Does Spark R supports nonlinear optimization with nonlinear constraints? Our business application supports two types of function convex and S-shaped curves and linear & non-linear constraints. These constraints can be combined with any one type

Re: Spark ignoring partition names without equals (=) separator

2016-11-27 Thread Bharath Bhushan
Prasanna, AFAIK spark does not handle folders without partition column names in them and there is no way to get spark to do it. I think the reason for this is that parquet file hierarchies had this info and historically spark deals more with those. On Mon, Nov 28, 2016 at 9:48 AM, Prasanna

Spark ignoring partition names without equals (=) separator

2016-11-27 Thread Prasanna Santhanam
I've been toying around with Spark SQL lately and trying to move some workloads from Hive. In the hive world the partitions below are recovered on an ALTER TABLE RECOVER PARTITIONS *Path:* s3://bucket-company/path/2016/03/11 s3://bucket-company/path/2016/03/12 s3://bucket-company/path/2016/03/13

if conditions

2016-11-27 Thread Hitesh Goyal
Hi team, I am using Apache spark 1.6.1 version. In this I am writing Spark SQL queries. I found 2 ways of writing SQL queries. One is by simple SQL syntax and other is by using spark Dataframe functions. I need to execute if conditions by using dataframe functions. Please specify how can I do

Re: Spark app write too many small parquet files

2016-11-27 Thread Denny Lee
Generally, yes - you should try to have larger data sizes due to the overhead of opening up files. Typical guidance is between 64MB-1GB; personally I usually stick with 128MB-512MB with the default of snappy codec compression with parquet. A good reference is Vida Ha's presentation Data Storage

Re: if conditions

2016-11-27 Thread Stuart White
Use the when() and otherwise() functions. For example: import org.apache.spark.sql.functions._ val rows = Seq(("bob", 1), ("lucy", 2), ("pat", 3)).toDF("name", "genderCode") rows.show ++--+ |name|genderCode| ++--+ | bob| 1| |lucy| 2| | pat| 3|

RE: if conditions

2016-11-27 Thread Hitesh Goyal
I tried this, but it is throwing an error that the method "when" is not applicable. I am doing this in Java instead of scala. Note:- I am using spark 1.6.1 version. -Original Message- From: Stuart White [mailto:stuart.whi...@gmail.com] Sent: Monday, November 28, 2016 10:26 AM To: Hitesh

Spark app write too many small parquet files

2016-11-27 Thread Kevin Tran
Hi Everyone, Does anyone know what is the best practise of writing parquet file from Spark ? As Spark app write data to parquet and it shows that under that directory there are heaps of very small parquet file (such as e73f47ef-4421-4bcc-a4db-a56b110c3089.parquet). Each parquet file is only 15KB