Re: [Spark 2.0.0] error when unioning to an empty dataset

2016-10-20 Thread Efe Selcuk
Thanks for the response. What do you mean by "semantically" the same? They're both Datasets of the same type, which is a case class, so I would expect compile-time integrity of the data. Is there a situation where this wouldn't be the case? Interestingly enough, if I instead create an empty rdd

Re: [Spark 2.0.0] error when unioning to an empty dataset

2016-10-20 Thread Agraj Mangal
I believe this normally comes when Spark is unable to perform union due to "difference" in schema of the operands. Can you check if the schema of both the datasets are semantically same ? On Tue, Oct 18, 2016 at 9:06 AM, Efe Selcuk wrote: > Bump! > > On Thu, Oct 13, 2016 at

Re: spark pi example fail on yarn

2016-10-20 Thread Li Li
I modified yarn-site.xml yarn.nodemanager.vmem-check-enabled to false and it works for yarn-client and spark-shell On Fri, Oct 21, 2016 at 10:59 AM, Li Li wrote: > I found a warn in nodemanager log. is the virtual memory exceed? how > should I config yarn to solve this

Re: spark pi example fail on yarn

2016-10-20 Thread Li Li
I found a warn in nodemanager log. is the virtual memory exceed? how should I config yarn to solve this problem? 2016-10-21 10:41:12,588 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 20299 for container-id

Re: spark pi example fail on yarn

2016-10-20 Thread Saisai Shao
It is not Spark has difficulty to communicate with YARN, it simply means AM is exited with FINISHED state. I'm guessing it might be related to memory constraints for container, please check the yarn RM and NM logs to find out more details. Thanks Saisai On Fri, Oct 21, 2016 at 8:14 AM, Xi Shen

Re: spark pi example fail on yarn

2016-10-20 Thread Xi Shen
16/10/20 18:12:14 ERROR cluster.YarnClientSchedulerBackend: Yarn application has already exited with state FINISHED! From this, I think it is spark has difficult communicating with YARN. You should check your Spark log. On Fri, Oct 21, 2016 at 8:06 AM Li Li wrote: which

[Spark ML] Using GBTClassifier in OneVsRest

2016-10-20 Thread ansari
It appears as if the inheritance hierarchy doesn't allow GBTClassifiers to be used as the binary classifier in a OneVsRest trainer. Is there a simple way to use gradient-boosted trees for multiclass (not binary) problems? Specifically, it complains that GBTClassifier doesn't inherit from

Re: spark pi example fail on yarn

2016-10-20 Thread Li Li
which log file should I On Thu, Oct 20, 2016 at 10:02 PM, Saisai Shao wrote: > Looks like ApplicationMaster is killed by SIGTERM. > > 16/10/20 18:12:04 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL TERM > 16/10/20 18:12:04 INFO yarn.ApplicationMaster: Final app status: >

Re: spark pi example fail on yarn

2016-10-20 Thread Li Li
which log file should I check? On Thu, Oct 20, 2016 at 11:32 PM, Amit Tank wrote: > I recently started learning spark so I may be completely wrong here but I > was facing similar problem with sparkpi on yarn. After changing yarn to > cluster mode it worked perfectly

Re: spark pi example fail on yarn

2016-10-20 Thread Li Li
yes, when I use yarn-cluster mode, it's correct. What's wrong with yarn-client? the spark shell is also not work because it's client mode. Any solution for this? On Thu, Oct 20, 2016 at 11:32 PM, Amit Tank wrote: > I recently started learning spark so I may be

Re: Equivalent Parquet File Repartitioning Benefits for Join/Shuffle?

2016-10-20 Thread adam kramer
I believe what I am looking for is DataFrameWriter.bucketBy which would allow for bucketing into physical parquet files by the desired columns. Then my question would be can DataFrame/Sets take advantage of this physical bucketing upon read of the parquet file for something like a self-join on the

Spark SQL parallelize

2016-10-20 Thread Selvam Raman
Hi, I am having 40+ structured data stored in s3 bucket as parquet file . I am going to use 20 table in the use case. There s a Main table which drive the whole flow. Main table contains 1k record. My use case is for every record in the main table process the rest of table( join group by

RDD groupBy() then random sort each group ?

2016-10-20 Thread Yang
in my application, I group by same training samples by their model_id's (the input table contains training samples for 100k different models), then each group ends up having about 1 million training samples, then I feed that group of samples to a little Logistic Regression solver (SGD), but SGD

Re: Spark 2.0 with Kafka 0.10 exception

2016-10-20 Thread Cody Koeninger
Right on, I put in a PR to make a note of that in the docs. On Thu, Oct 20, 2016 at 12:13 PM, Srikanth wrote: > Yeah, setting those params helped. > > On Wed, Oct 19, 2016 at 1:32 PM, Cody Koeninger wrote: >> >> 60 seconds for a batch is above the

Re: Spark 2.0 with Kafka 0.10 exception

2016-10-20 Thread Srikanth
Yeah, setting those params helped. On Wed, Oct 19, 2016 at 1:32 PM, Cody Koeninger wrote: > 60 seconds for a batch is above the default settings in kafka related > to heartbeat timeouts, so that might be related. Have you tried > tweaking session.timeout.ms,

Predict a single vector with the new spark.ml API to avoid groupByKey() after a flatMap()?

2016-10-20 Thread jglov
Is there a way to predict a single vector with the new spark.ml API, although in my case it's because I want to do this within a map() to avoid calling groupByKey() after a flatMap(): *Current code (pyspark):* % Given 'model', 'rdd', and a function 'split_element' that splits an element of the

Re: Mlib RandomForest (Spark 2.0) predict a single vector

2016-10-20 Thread jglov
I would also like to know if there is a way to predict a single vector with the new spark.ml API, although in my case it's because I want to do this within a map() to avoid calling groupByKey() after a flatMap(): *Current code (pyspark):* % Given 'model', 'rdd', and a function 'split_element'

HashingTF for TF.IDF computation

2016-10-20 Thread Ciumac Sergiu
Hello everyone, I'm having a usage issue with HashingTF class from Spark MLLIB. I'm computing TF.IDF on a set of terms/documents which later I'm using to identify most important ones in each of the input document. Below is a short code snippet which outlines the example (2 documents with 2

Re: spark pi example fail on yarn

2016-10-20 Thread Amit Tank
I recently started learning spark so I may be completely wrong here but I was facing similar problem with sparkpi on yarn. After changing yarn to cluster mode it worked perfectly fine. Thank you, Amit On Thursday, October 20, 2016, Saisai Shao wrote: > Looks like

Re: Microbatches length

2016-10-20 Thread Paulo Candido
Thanks a lot. I'll check it. Regards. Em qui, 20 de out de 2016 às 10:50, vincent gromakowski < vincent.gromakow...@gmail.com> escreveu: You can still implement your own logic with akka actors for instance. Based on some threshold the actor can launch spark batch mode using the same spark

Re: spark pi example fail on yarn

2016-10-20 Thread Elek, Marton
Try to set the memory size limits. For example: ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g --executor-cores 1 ./examples/jars/spark-examples_2.11-2.0.0.2.5.2.0-47.jar By default yarn

Re: spark pi example fail on yarn

2016-10-20 Thread Saisai Shao
Looks like ApplicationMaster is killed by SIGTERM. 16/10/20 18:12:04 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL TERM 16/10/20 18:12:04 INFO yarn.ApplicationMaster: Final app status: This container may be killed by yarn NodeManager or other processes, you'd better check yarn log to dig out

Re: Joins of typed datasets

2016-10-20 Thread daunnc
A situation changes a bit, and the workaround is to add `K` restriction (K should be a subtype of Product); Thought I have right now another error: org.apache.spark.sql.AnalysisException: cannot resolve '(`key` = `key`)' due to data type mismatch: differing types in '(`key` = `key`)'

Re: Microbatches length

2016-10-20 Thread vincent gromakowski
You can still implement your own logic with akka actors for instance. Based on some threshold the actor can launch spark batch mode using the same spark context... It's only an idea , no real experience. Le 20 oct. 2016 1:31 PM, "Paulo Candido" a écrit : > In this case I

Re: Spark Random Forest training cost same time on yarn as on standalone

2016-10-20 Thread Xi Shen
If you are running on your local, I do not see the point that you start with 32 executors with 2 cores for each. Also, you can check the Spark web console to find out where the time spent. Also, you may want to read http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

Re: Ensuring an Avro File is NOT Splitable

2016-10-20 Thread Jörn Franke
What is the use case of this? You will reduce performance significantly. Nevertheless, the way you propose is the way to go, but I do not recommend it. > On 20 Oct 2016, at 14:00, Ashan Taha wrote: > > Hi > > What’s the best way to make sure an Avro file is NOT Splitable

Ensuring an Avro File is NOT Splitable

2016-10-20 Thread Ashan Taha
Hi What's the best way to make sure an Avro file is NOT Splitable when read in Spark? Would you override the AvroKeyInputFormat.issplitable (to return false) and then call this using newAPIHadoopRDD? Or is there a better way using the sqlContext.read? Thanks in advance

Re: Microbatches length

2016-10-20 Thread Paulo Candido
In this case I haven't any alternatives to get microbatches with same length? Using another class or any configuration? I'm using socket. Thank you for attention. Em qui, 20 de out de 2016 às 09:24, 王贺(Gabriel) escreveu: > The interval is for time, so you won't get

Re: Microbatches length

2016-10-20 Thread Gabriel
The interval is for time, so you won't get micro-batches in same data size but same time length. Yours sincerely, Gabriel (王贺) Mobile: +86 18621263813 On Thu, Oct 20, 2016 at 6:38 PM, pcandido wrote: > Hello folks, > > I'm using Spark Streaming. My question is simple: >

Expression Encoder for Map[Int, String] in a custom Aggregator on a Dataset

2016-10-20 Thread Anton Okolnychyi
Hi all, I am trying to use my custom Aggregator on a GroupedDataset of case classes to create a hash map using Spark SQL 1.6.2. My Encoder[Map[Int, String]] is not capable to reconstruct the reduced values if I define it via ExpressionEncoder(). However, everything works fine if I define it as

spark pi example fail on yarn

2016-10-20 Thread Li Li
I am setting up a small yarn/spark cluster. hadoop/yarn version is 2.7.3 and I can run wordcount map-reduce correctly in yarn. And I am using spark-2.0.1-bin-hadoop2.7 using command: ~/spark-2.0.1-bin-hadoop2.7$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client

Microbatches length

2016-10-20 Thread pcandido
Hello folks, I'm using Spark Streaming. My question is simple: The documentation says that microbatches arrive in intervals. The intervals are in real time (minutes, seconds). I want to get microbatches with same length, so, I can configure SS to return microbatches when it reach a determined

Re: Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-20 Thread Steve Loughran
> On 19 Oct 2016, at 21:46, Jakob Odersky wrote: > > Another reason I could imagine is that files are often read from HDFS, > which by default uses line terminators to separate records. > > It is possible to implement your own hdfs delimiter finder, however > for arbitrary

Spark Random Forest training cost same time on yarn as on standalone

2016-10-20 Thread 陈哲
I'm training random forest model using spark2.0 on yarn with cmd like: $SPARK_HOME/bin/spark-submit \ --class com.netease.risk.prediction.HelpMain --master yarn --deploy-mode client --driver-cores 1 --num-executors 32 --executor-cores 2 --driver-memory 10g --executor-memory 6g \ --conf

Re: Can i display message on console when use spark on yarn?

2016-10-20 Thread ayan guha
What do you exactly mean by Yarn Console? We use spark-submit and it generates exactly same log as you mentioned on driver console, On Thu, Oct 20, 2016 at 8:21 PM, Jone Zhang wrote: > I submit spark with "spark-submit --master yarn-cluster --deploy-mode > cluster" > How

Re: pyspark dataframe codes for lead lag to column

2016-10-20 Thread ayan guha
Yes there are similar functions available, depending on your spark version look up Pyspark SQL Function module documentation. I also prefer to use SQL directly within pyspark. On Thu, Oct 20, 2016 at 8:18 PM, Mendelson, Assaf wrote: > Depending on your usecase, you may

Can i display message on console when use spark on yarn?

2016-10-20 Thread Jone Zhang
I submit spark with "spark-submit --master yarn-cluster --deploy-mode cluster" How can i display message on yarn console. I expect it to be like this: . 16/10/20 17:12:53 main INFO org.apache.spark.deploy.yarn.Client>SPK> Application report for application_1453970859007_481440 (state:

RE: pyspark dataframe codes for lead lag to column

2016-10-20 Thread Mendelson, Assaf
Depending on your usecase, you may want to take a look at window functions From: muhammet pakyürek [mailto:mpa...@hotmail.com] Sent: Thursday, October 20, 2016 11:36 AM To: user@spark.apache.org Subject: pyspark dataframe codes for lead lag to column is there pyspark dataframe codes for lead

Where condition on columns of Arrays does no longer work in spark 2

2016-10-20 Thread filthysocks
I have a Column in a DataFrame that contains Arrays and I wanna filter for equality. It does work fine in spark 1.6 but not in 2.0In spark 1.6.2: import org.apache.spark.sql.SQLContextcase class DataTest(lists: Seq[Int])val sql = new SQLContext(sc)val data = sql.createDataFrame(sc.parallelize(Seq(

Re: Dataframe schema...

2016-10-20 Thread Michael Armbrust
What is the issue you see when unioning? On Wed, Oct 19, 2016 at 6:39 PM, Muthu Jayakumar wrote: > Hello Michael, > > Thank you for looking into this query. In my case there seem to be an > issue when I union a parquet file read from disk versus another dataframe > that I

pyspark dataframe codes for lead lag to column

2016-10-20 Thread muhammet pakyürek
is there pyspark dataframe codes for lead lag to column? lead/lag column is something 1 lag -1lead 2 213 324 435 54 -1

How to iterate the element of an array in DataFrame?

2016-10-20 Thread Yan Facai
Hi, I want to extract the attribute `weight` of an array, and combine them to construct a sparse vector. ### My data is like this: scala> mblog_tags.printSchema root |-- category.firstCategory: array (nullable = true) ||-- element: struct (containsNull = true) |||-- category: