Re: Apply ML to grouped dataframe

2016-08-22 Thread Nirmal Fernando
On Tue, Aug 23, 2016 at 10:56 AM, Wen Pei Yu wrote: > We can group a dataframe by one column like > > df.groupBy(df.col("gender")) > On top of this DF, use a filter that would enable you to extract the grouped DF as separated DFs. Then you can apply ML on top of each DF. eg:

Re: Apply ML to grouped dataframe

2016-08-22 Thread Wen Pei Yu
We can group a dataframe by one column like df.groupBy(df.col("gender")) It like split a dataframe to multiple dataframe. Currently, we can only apply simple sql function to this GroupedData like agg, max etc. What we want is apply one ML algorithm to each group. Regards. From: Nirmal

Re: Apply ML to grouped dataframe

2016-08-22 Thread Nirmal Fernando
Hi Wen, AFAIK Spark MLlib implements its machine learning algorithms on top of Spark dataframe API. What did you mean by a grouped dataframe? On Tue, Aug 23, 2016 at 10:42 AM, Wen Pei Yu wrote: > Hi Nirmal > > I didn't get your point. > Can you tell me more about how to use

Re: Apply ML to grouped dataframe

2016-08-22 Thread Wen Pei Yu
Hi Nirmal I didn't get your point. Can you tell me more about how to use MLlib to grouped dataframe? Regards. Wenpei. From: Nirmal Fernando To: Wen Pei Yu/China/IBM@IBMCN Cc: User Date: 08/23/2016 10:26 AM Subject:Re: Apply ML to

Re: Spark 2.0 - Join statement compile error

2016-08-22 Thread Deepak Sharma
Hi Subhajit Try this in your join: *val* *df** = **sales_demand**.**join**(**product_master**,**sales_demand**.$"INVENTORY_ITEM_ID" =**== **product_master**.$"INVENTORY_ITEM_ID",**"inner"**)* On Tue, Aug 23, 2016 at 2:30 AM, Subhajit Purkayastha wrote: > *All,* > > > > *I

Re: Apply ML to grouped dataframe

2016-08-22 Thread Nirmal Fernando
You can use Spark MLlib http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-based-api-is-primary-api On Tue, Aug 23, 2016 at 7:34 AM, Wen Pei Yu wrote: > Hi > > We have a dataframe, then want group it and apply a ML algorithm or > statistics(say t test)

Apply ML to grouped dataframe

2016-08-22 Thread Wen Pei Yu
Hi We have a dataframe, then want group it and apply a ML algorithm or statistics(say t test) to each one. Is there any efficient way for this situation? Currently, we transfer to pyspark, use groupbykey and apply numpy function to array. But this wasn't an efficient way, right? Regards.

Re: Spark with Parquet

2016-08-22 Thread shamu
Create a hive table x Load your csv data in table x (LOAD DATA INPATH 'file/path' INTO TABLE x;) create hive table y with same structure as x except add STORED AS PARQUET; INSERT OVERWRITE TABLE y SELECT * FROM x; This would get you parquet files under /user/hive/warehouse/y (as an example)

Re: word count on parquet file

2016-08-22 Thread shamu
I changed the code to below... JavaPairRDD rdd = sc.newAPIHadoopFile(inputFile, ParquetInputFormat.class, NullWritable.class, String.class, mrConf); JavaRDD words = rdd.values().flatMap( new FlatMapFunction() { public Iterable call(String

DataFrameWriter bug after RandomSplit?

2016-08-22 Thread evanzamir
Trying to build a ML model using LogisticRegression, I ran into the following unexplainable issue. Here's a snippet of code which training, testing = data.randomSplit([0.8, 0.2], seed=42) print("number of rows in testing = {}".format(testing.count()))

Combining multiple models in Spark-ML 2.0

2016-08-22 Thread janardhan shetty
Hi, Are there any pointers, links on stacking multiple models in spark dataframes ?. WHat strategies can be employed if we need to combine greater than 2 models ?

Re: Spark 2.0 - Join statement compile error

2016-08-22 Thread Vishal Maru
try putting join condition as String On Mon, Aug 22, 2016 at 5:00 PM, Subhajit Purkayastha wrote: > *All,* > > > > *I have the following dataFrames and the temp table. * > > > > *I am trying to create a new DF , the following statement is not compiling* > > > > *val* *df** =

Re: word count on parquet file

2016-08-22 Thread ayan guha
You are missing input. Mrconf is not the way to add input files. In spark, try Dataframe read functions or sc.textfile function. Best Ayan On 23 Aug 2016 07:12, "shamu" wrote: > Hi All, > I am a newbie to Spark/Hadoop. > I want to read a parquet file and a perform a

word count on parquet file

2016-08-22 Thread shamu
Hi All, I am a newbie to Spark/Hadoop. I want to read a parquet file and a perform a simple word-count. Below is my code, however I get an error: Exception in thread "main" java.io.IOException: No input paths specified in job at

Spark 2.0 - Join statement compile error

2016-08-22 Thread Subhajit Purkayastha
All, I have the following dataFrames and the temp table. I am trying to create a new DF , the following statement is not compiling val df = sales_demand.join(product_master,(sales_demand.INVENTORY_ITEM_ID==product_ma ster.INVENTORY_ITEM_ID),joinType="inner") What am I

Re: spark.lapply in SparkR: Error in writeBin(batch, con, endian = "big")

2016-08-22 Thread Felix Cheung
How big is the output from score()? Also could you elaborate on what you want to broadcast? On Mon, Aug 22, 2016 at 11:58 AM -0700, "Cinquegrana, Piero" > wrote: Hello, I am using the new R API in SparkR spark.lapply

spark.lapply in SparkR: Error in writeBin(batch, con, endian = "big")

2016-08-22 Thread Cinquegrana, Piero
Hello, I am using the new R API in SparkR spark.lapply (spark 2.0). I am defining a complex function to be run across executors and I have to send the entire dataset, but there is not (that I could find) a way to broadcast the variable in SparkR. I am thus reading the dataset in each executor

Re: Do we still need to use Kryo serializer in Spark 1.6.2 ?

2016-08-22 Thread Jacek Laskowski
Hi, I've not heard this. And moreover I see Kryo supported in Encoders (SerDes) in Spark 2.0. https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala#L151 Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache

Do we still need to use Kryo serializer in Spark 1.6.2 ?

2016-08-22 Thread Eric Ho
I heard that Kryo will get phased out at some point but not sure which Spark release. I'm using PySpark, does anyone has any docs on how to call / use Kryo Serializer in PySpark ? Thanks. -- -eric ho

Re: Disable logger in SparkR

2016-08-22 Thread Felix Cheung
You should be able to do that with log4j.properties http://spark.apache.org/docs/latest/configuration.html#configuring-logging Or programmatically https://spark.apache.org/docs/2.0.0/api/R/setLogLevel.html _ From: Yogesh Vyas

Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-22 Thread janardhan shetty
thanks Nick. This Jira seems to be in stagnant state for a while any update when this will be released ? On Mon, Aug 22, 2016 at 5:07 AM, Nick Pentreath wrote: > I believe it may be because of this issue (https://issues.apache.org/ > jira/browse/SPARK-13030). OHE is

[pyspark] How to ensure rdd.takeSample produce the same set everytime?

2016-08-22 Thread Chua Jie Sheng
Hi all, I have been trying on different machine to make rdd.takeSample produce the same set but failed. I have seed the method with the same value on different machine but the result is different. Any idea why? Best Regards Jie Sheng Important: This email is confidential and may be privileged.

Re: Spark 2.0 regression when querying very wide data frames

2016-08-22 Thread mhornbech
I dont think thats the issue. It sound very much like this https://issues.apache.org/jira/browse/SPARK-16664 Morten > Den 20. aug. 2016 kl. 21.24 skrev ponkin [via Apache Spark User List] > : > > Did you try to load wide, for example, CSV file or

Re: Best way to read XML data from RDD

2016-08-22 Thread Diwakar Dhanuskodi
Below is source code for parsing xml RDD which has single line xml data. import scala.xml.XML import scala.xml.Elem import scala.collection.mutable.ArrayBuffer import scala.xml.Text import scala.xml.Node var dataArray= new ArrayBuffer[String]() def processNode(node:

mutable.LinkedHashMap kryo serialization issues

2016-08-22 Thread Rahul Palamuttam
Hi, Just sending this again to see if others have had this issue. I recently switched to using kryo serialization and I've been running into errors with the mutable.LinkedHashMap class. If I don't register the mutable.LinkedHashMap class then I get an ArrayStoreException seen below. If I do

UDTRegistration (in Java)

2016-08-22 Thread raghukiran
Hi, After moving to Spark 2.0, the UDTRegistration is giving me some issues. I am trying the following (in Java): UDTRegistration.register(userclassName, udtclassName); After this, when I try creating a DataFrame, it throws an exception that the userclassName is not registered. Can anyone point

Using spark to distribute jobs to standalone servers

2016-08-22 Thread Larry White
Hi, I have a bit of an unusual use-case and would *greatly* *appreciate* some feedback as to whether it is a good fit for spark. I have a network of compute/data servers configured as a tree as shown below - controller - server 1 - server 2 - server 3 - etc. There are

Re: Best way to read XML data from RDD

2016-08-22 Thread Darin McBeath
Yes, you can use it for single line XML or even a multi-line XML. In our typical mode of operation, we have sequence files (where the value is the XML).  We then run operations over the XML to extract certain values or to transform the XML into another format (such as json). If i understand your

Disable logger in SparkR

2016-08-22 Thread Yogesh Vyas
Hi, Is there any way of disabling the logging on console in SparkR ? Regards, Yogesh - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-22 Thread Ben Teeuwen
I don’t think scaling RAM is a sane strategy to fixing these problems with using a dataframe / transformer approach to creating large sparse vectors. One, though yes it will delay when it will fail, it will still fail. The original case I emailed about I tried this, and after waiting 50

Avoid RDD shuffling in a join after Distributed Matrix operation

2016-08-22 Thread Tharindu
hi, Just wanted to get your input how to avoid RDD shuffling in a join after Distributed Matrix operation spark Following is what my app would look like 1. created a dense matrix as a input to calculate cosine distance between columns val rowMarixIn = sc.textFile("input.csv").map{ line

Fwd: How to avoid RDD shuffling in join after Distributed Matrix calculation

2016-08-22 Thread Tharindu Thundeniya
hi, Just wanted to get your input how to avoid RDD shuffling in a join after Distributed Matrix operation spark Following is what my app would look like 1. created a dense matrix as a input to calculate cosine distance between columns val rowMarixIn = sc.textFile("input.csv").map{ line =>

Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-22 Thread Nick Pentreath
I believe it may be because of this issue ( https://issues.apache.org/jira/browse/SPARK-13030). OHE is not an estimator - hence in cases where the number of categories differ between train and test, it's not usable in the current form. It's tricky to work around, though one option is to use

RE: Best way to read XML data from RDD

2016-08-22 Thread Puneet Tripathi
I was building a small app to stream messages from kafka via spark. The message was an xml, every message is a new xml. I wrote a simple app to do so[ this app expects the xml to be a single line] from __future__ import print_function from pyspark.sql import Row import xml.etree.ElementTree as

Re: Populating tables using hive and spark

2016-08-22 Thread Mich Talebzadeh
Ok This is my test 1) create table in Hive and populate it with two rows hive> create table testme (col1 int, col2 string); OK hive> insert into testme values (1,'London'); Query ID = hduser_20160821212812_2a8384af-23f1-4f28-9395-a99a5f4c1a4a OK hive> insert into testme values (2,'NY'); Query ID

Re: Best way to read XML data from RDD

2016-08-22 Thread Hyukjin Kwon
Do you mind share your codes and sample data? It should be okay with single XML if I remember this correctly. 2016-08-22 19:53 GMT+09:00 Diwakar Dhanuskodi : > Hi Darin, > > Ate you using this utility to parse single line XML? > > > Sent from Samsung Mobile.

Re: Best way to read XML data from RDD

2016-08-22 Thread Diwakar Dhanuskodi
Hi Darin,  Ate  you  using  this  utility  to  parse single line XML? Sent from Samsung Mobile. Original message From: Darin McBeath Date:21/08/2016 17:44 (GMT+05:30) To: Hyukjin Kwon , Jörn Franke Cc:

Re: Best way to read XML data from RDD

2016-08-22 Thread Diwakar Dhanuskodi
Hi Franke,  Source  format  cannot  be  changed as  of  now  add  it  is  a pretty standard format working  for  years. Yeah  creating  one  parser I can  tryout . Sent from Samsung Mobile. Original message From: Jörn Franke Date:20/08/2016 11:40

Re: Best way to read XML data from RDD

2016-08-22 Thread Diwakar Dhanuskodi
Hi Kwon,  Was trying  out  spark  XML library .  I keep  on  getting  errors in inferring schema. Looks like it cannot infer single line  XML data.  Sent from Samsung Mobile. Original message From: Hyukjin Kwon Date:21/08/2016 15:40 (GMT+05:30) To: Jörn

Re: Populating tables using hive and spark

2016-08-22 Thread Nitin Kumar
Hi Furcy, If I execute the command "ANALYZE TABLE TEST_ORC COMPUTE STATISTICS" before checking the count from hive, Hive returns the correct count albeit it does not spawn a map-reduce job for computing the count. I'm running a HDP 2.4 Cluster with Hive 1.2.1.2.4 and Spark 1.6.1 If others can

updateStateByKey for window batching

2016-08-22 Thread Dávid Szakállas
Hi! I’m curious about the fault-tolerance properties of stateful streaming operations. I am specifically interested about updateStateByKey. What happens if a node fails during processing? Is the state recoverable? Our use case is the following: we have messages arriving from a message queue

Re: Plans for improved Spark DataFrame/Dataset unit testing?

2016-08-22 Thread Bedrytski Aliaksandr
Hi Everett, HiveContext is initialized only once as a lazy val, so if you mean initializing different jvms for each (or a group of) test(s), then in this case the context will not, obviously, be shared. But specs2 (by default) launches specs (inside of tests classes) in parallel threads and in

Populating tables using hive and spark

2016-08-22 Thread Nitin Kumar
Hi! I've noticed that hive has problems in registering new data records if the same table is written to using both the hive terminal and spark sql. The problem is demonstrated through the commands listed below hive> use

Re: [Spark2] Error writing "complex" type to CSV

2016-08-22 Thread Hyukjin Kwon
Whether it writes the data as garbage or string representation, this is not able to load back. So, I'd say both are wrong and bugs. I think it'd be great if we can write and read back CSV in its own format but I guess we can't for now. 2016-08-20 2:54 GMT+09:00 Efe Selcuk :