Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-21 Thread janardhan shetty
Thanks Krishna for your response. Features in the training set has more categories than test set so when vectorAssembler is used these numbers are usually different and I believe it is as expected right ? Test dataset usually will not have so many categories in their features as Train is the

Re: Entire XML data as one of the column in DataFrame

2016-08-21 Thread Hyukjin Kwon
I can't say this is the best way to do so but my instant thought is as below: Create two df sc.hadoopConfiguration.set(XmlInputFormat.START_TAG_KEY, s"") sc.hadoopConfiguration.set(XmlInputFormat.END_TAG_KEY, s"") sc.hadoopConfiguration.set(XmlInputFormat.ENCODING_KEY, "UTF-8") val strXmlDf =

Hi,

2016-08-21 Thread Xi Shen
I found there are several .conf files in the conf directory, which one is used as the default one when I click the "new" button on the notebook homepage? I want to edit the default profile configuration so all my notebooks are created with custom settings. -- Thanks, David S.

Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-21 Thread Krishna Sankar
Hi, Just after I sent the mail, I realized that the error might be with the training-dataset not the test-dataset. 1. it might be that you are feeding the full Y vector for training. 2. Which could mean, you are using ~50-50 training-test split. 3. Take a good look at the code that

Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-21 Thread Krishna Sankar
Hi, Looks like the test-dataset has different sizes for X & Y. Possible steps: 1. What is the test-data-size ? - If it is 15,909, check the prediction variable vector - it is now 29,471, should be 15,909 - If you expect it to be 29,471, then the X Matrix is not right.

Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-21 Thread janardhan shetty
Hi, I have built the logistic regression model using training-dataset. When I am predicting on a test-dataset, it is throwing the below error of size mismatch. Steps done: 1. String indexers on categorical features. 2. One hot encoding on these indexed features. Any help is appreciated to

Re: submitting spark job with kerberized Hadoop issue

2016-08-21 Thread Aneela Saleem
Any update on this? On Tuesday, 16 August 2016, Aneela Saleem wrote: > Thanks Steve, > > I went through this but still not able to fix the issue > > On Mon, Aug 15, 2016 at 2:01 AM, Steve Loughran

Re: Accessing HBase through Spark with Security enabled

2016-08-21 Thread Aneela Saleem
Any update on this? On Tuesday, 16 August 2016, Aneela Saleem wrote: > Thanks Steve, > > I have gone through it's documentation, i did not get any idea how to > install it. Can you help me? > > On Mon, Aug 15, 2016 at 4:23 PM, Steve Loughran

RE: Flattening XML in a DataFrame

2016-08-21 Thread srikanth.jella
Hi Hyukjin, I have created the below issue. https://github.com/databricks/spark-xml/issues/155 Sent from Mail for Windows 10 From: Hyukjin Kwon

Entire XML data as one of the column in DataFrame

2016-08-21 Thread srikanth.jella
Hello Experts, I’m using spark-xml package which is automatically inferring my schema and creating a DataFrame. I’m extracting few fields like id, name (which are unique) from below xml, but my requirement is to store entire XML in one of the column as well. I’m writing this data to AVRO

Re: Spark Streaming application failing with Token issue

2016-08-21 Thread Mich Talebzadeh
Hi Kamesh, The message you are getting after 7 days: PriviledgedActionException as:sys_bio_replicator (auth:KERBEROS) cause:org.apache.hadoop.ipc.RemoteException(org.apache. hadoop.security.token.SecretManager$InvalidToken): Token has expired Sounds like an IPC issue with Kerberos

Re: How to continuous update or refresh RandomForestClassificationModel

2016-08-21 Thread Jacek Laskowski
Hi, That's my understanding -- you need to fit another model given the training data. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Fri, Aug 19, 2016 at

Re: Plans for improved Spark DataFrame/Dataset unit testing?

2016-08-21 Thread Everett Anderson
On Sun, Aug 21, 2016 at 3:08 AM, Bedrytski Aliaksandr wrote: > Hi, > > we share the same spark/hive context between tests (executed in > parallel), so the main problem is that the temporary tables are > overwritten each time they are created, this may create race conditions >

Re: Reporting errors from spark sql

2016-08-21 Thread Jacek Laskowski
Hi, See https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParseDriver.scala#L65 to learn how Spark SQL parses SQL texts. It could give you a way out. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache

Re: Unsubscribe

2016-08-21 Thread Rahul Palamuttam
Hi sudhanshu, Try user-unsubscribe.spark.apache.org - Rahul P Sent from my iPhone > On Aug 21, 2016, at 9:19 AM, Sudhanshu Janghel > wrote: > > Hello, > > I wish to unsubscribe from the channel. > > KIND REGARDS, > SUDHANSHU

Unsubscribe

2016-08-21 Thread Sudhanshu Janghel
Hello, I wish to unsubscribe from the channel. KIND REGARDS, SUDHANSHU

Re: Spark Streaming application failing with Token issue

2016-08-21 Thread Jacek Laskowski
Hi Kamesh, I believe your only option is to re-start your application every 7 days (perhaps you need to enable checkpointing). See https://github.com/apache/spark/commit/ab648c0004cfb20d53554ab333dd2d198cb94ffa for a change with automatic security token renewal. Pozdrawiam, Jacek Laskowski

Re: Best way to read XML data from RDD

2016-08-21 Thread Darin McBeath
Another option would be to look at spark-xml-utils. We use this extensively in the manipulation of our XML content. https://github.com/elsevierlabs-os/spark-xml-utils There are quite a few examples. Depending on your preference (and what you want to do), you could use xpath, xquery, or

Re: Dataframe corrupted when sqlContext.read.json on a Gzipped file that contains more than one file

2016-08-21 Thread Sean Owen
You are attempting to read a tar file. That won't work. A compressed JSON file would. On Sun, Aug 21, 2016, 12:52 Chua Jie Sheng wrote: > Hi Spark user list! > > I have been encountering corrupted records when reading Gzipped files that > contains more than one file. > >

Dataframe corrupted when sqlContext.read.json on a Gzipped file that contains more than one file

2016-08-21 Thread Chua Jie Sheng
Hi Spark user list! I have been encountering corrupted records when reading Gzipped files that contains more than one file. Example: I have two .json file, [a.json, b.json] Each have multiple records (one line, one record). I tar both of them together on Mac OS X, 10.11.6 bsdtar 2.8.3 -

Re: Best way to read XML data from RDD

2016-08-21 Thread Hyukjin Kwon
Hi Diwakar, Spark XML library can take RDD as source. ``` val df = new XmlReader() .withRowTag("book") .xmlRdd(sqlContext, rdd) ``` If performance is critical, I would also recommend to take care of creation and destruction of the parser. If the parser is not serializble, then you can do

Re: Plans for improved Spark DataFrame/Dataset unit testing?

2016-08-21 Thread Bedrytski Aliaksandr
Hi, we share the same spark/hive context between tests (executed in parallel), so the main problem is that the temporary tables are overwritten each time they are created, this may create race conditions as these tempTables may be seen as global mutable shared state. So each time we create a

DCOS - s3

2016-08-21 Thread Martin Somers
I having trouble loading data from an s3 repo Currently DCOS is running spark 2 so I not sure if there is a modifcation to code with the upgrade my code atm looks like this sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "xxx") sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "xxx")