Re: Flattening XML in a DataFrame

2016-08-16 Thread Hyukjin Kwon
Sorry for late reply. Currently, the library only supports to load XML documents just as they are. Do you mind if I ask open an issue with some more explanations here, https://github.com/databricks/spark-xml/issues? 2016-08-17 7:22 GMT+09:00 Sreekanth Jella : > Hi

Re: JavaRDD to DataFrame fails with null pointer exception in 1.6.0

2016-08-16 Thread sudhir patil
Tested with java 7 & 8 , same issue on both versions. On Aug 17, 2016 12:29 PM, "spats" wrote: > Cannot convert JavaRDD to DataFrame in spark 1.6.0, throws null pointer > exception & no more details. Can't really figure out what really happening. > Any pointer to fixes?

Re: VectorUDT with spark.ml.linalg.Vector

2016-08-16 Thread Alexey Svyatkovskiy
Hi Yanbo, Thanks for your reply. I will keep an eye on that pull request. For now, I decided to just put my code inside org.apache.spark.ml to be able to access private classes. Thanks, Alexey On Tue, Aug 16, 2016 at 11:13 PM, Yanbo Liang wrote: > It seams that VectorUDT

JavaRDD to DataFrame fails with null pointer exception in 1.6.0

2016-08-16 Thread spats
Cannot convert JavaRDD to DataFrame in spark 1.6.0, throws null pointer exception & no more details. Can't really figure out what really happening. Any pointer to fixes? //convert JavaRDD to DataFrame DataFrame schemaPeople = sqlContext.createDataFrame(people, Person.class); // exception with no

Re: Data frame Performance

2016-08-16 Thread Selvam Raman
Hi Mich, The input and output are just for example and it s not exact column name. Colc not needed. The code which I shared is working fine but need to confirm, was it right approach and effect performance. Thanks, Selvam R +91-97877-87724 On Aug 16, 2016 5:18 PM, "Mich Talebzadeh"

Re: SPARK MLLib - How to tie back Model.predict output to original data?

2016-08-16 Thread ayan guha
Hi Thank you for your reply. Yes, I can get prediction and original features together. My question is how to tie them back to other parts of the data, which was not in LP. For example, I have a bunch of other dimensions which are not part of features or label. Sorry if this is a stupid

Re: VectorUDT with spark.ml.linalg.Vector

2016-08-16 Thread Yanbo Liang
It seams that VectorUDT is private and can not be accessed out of Spark currently. It should be public but we need to do some refactor before make it public. You can refer the discussion at https://github.com/apache/spark/pull/12259 . Thanks Yanbo 2016-08-16 9:48 GMT-07:00 alexeys

Re: SPARK MLLib - How to tie back Model.predict output to original data?

2016-08-16 Thread Yanbo Liang
MLlib will keep the original dataset during transformation, it just append new columns to existing DataFrame. That is you can get both prediction value and original features from the output DataFrame of model.transform. Thanks Yanbo 2016-08-16 17:48 GMT-07:00 ayan guha : >

Re: [SQL] Why does spark.read.csv.cache give me a WARN about cache but not text?!

2016-08-16 Thread Michael Armbrust
try running explain on each of these. my guess would be caching in broken in some cases. On Tue, Aug 16, 2016 at 6:05 PM, Jacek Laskowski wrote: > Hi, > > Can anyone explain why spark.read.csv("people.csv").cache.show ends up > with a WARN while

create SparkSession without loading defaults for unit tests

2016-08-16 Thread Koert Kuipers
for unit tests i would like to create a SparkSession that does not load anything from system properties, similar to: new SQLContext(new SparkContext(new SparkConf(loadDefaults = false))) how do i go about doing this? i dont see a way. thanks! koert

Re: GraphFrames 0.2.0 released

2016-08-16 Thread Jacek Laskowski
Hi Tim, AWESOME. Thanks a lot for releasing it. That makes me even more eager to see it in Spark's codebase (and replacing the current RDD-based API)! Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at

Re: Rebalancing when adding kafka partitions

2016-08-16 Thread Cody Koeninger
The underlying kafka consumer On Tue, Aug 16, 2016 at 2:17 PM, Srikanth wrote: > Yes, SubscribePattern detects new partition. Also, it has a comment saying > >> Subscribe to all topics matching specified pattern to get dynamically >> assigned partitions. >> * The pattern

[SQL] Why does spark.read.csv.cache give me a WARN about cache but not text?!

2016-08-16 Thread Jacek Laskowski
Hi, Can anyone explain why spark.read.csv("people.csv").cache.show ends up with a WARN while spark.read.text("people.csv").cache.show does not? It happens in 2.0 and today's build. scala> sc.version res5: String = 2.1.0-SNAPSHOT scala> spark.read.csv("people.csv").cache.show

SPARK MLLib - How to tie back Model.predict output to original data?

2016-08-16 Thread ayan guha
Hi I have a dataset as follows: DF: amount:float date_read:date meter_number:string I am trying to predict future amount based on past 3 weeks consumption (and a heaps of weather data related to date). My Labelpoint looks like label (populated from DF.amount) features (populated from a bunch

Re: Anyone else having trouble with replicated off heap RDD persistence?

2016-08-16 Thread Chanh Le
Hi Michael, You should you Alluxio instead. http://www.alluxio.org/docs/master/en/Running-Spark-on-Alluxio.html It should be easier. Regards, Chanh > On Aug 17, 2016, at 5:45 AM, Michael Allman

Can't connect to remote spark standalone cluster: getting WARN TaskSchedulerImpl: Initial job has not accepted any resources

2016-08-16 Thread Andrew Vykhodtsev
Dear all, I am trying to connect a remote windows machine to a standalone spark cluster (a single VM running on Ubuntu server with 8 cores and 64GB RAM). Both client and server have Spark 2.0 software prebuilt for Hadoop 2.6, and hadoop 2.7 I have the following settings on cluster: export

Anyone else having trouble with replicated off heap RDD persistence?

2016-08-16 Thread Michael Allman
Hello, A coworker was having a problem with a big Spark job failing after several hours when one of the executors would segfault. That problem aside, I speculated that her job would be more robust against these kinds of executor crashes if she used replicated RDD storage. She's using off heap

RE: Flattening XML in a DataFrame

2016-08-16 Thread Sreekanth Jella
Hi Experts, Please suggest. Thanks in advance. Thanks, Sreekanth From: Sreekanth Jella [mailto:srikanth.je...@gmail.com] Sent: Sunday, August 14, 2016 11:46 AM To: 'Hyukjin Kwon' Cc: 'user @spark' Subject: Re: Flattening XML in a

Re: Spark 2.0.0 JaninoRuntimeException

2016-08-16 Thread Aris
Hello Michael, I made a JIRA with sample code to reproduce this problem. I set you as the "shepherd" -- I Hope this is enough, otherwise I can fix it. https://issues.apache.org/jira/browse/SPARK-17092 On Sun, Aug 14, 2016 at 9:38 AM, Michael Armbrust wrote: > Anytime

Re: Spark 2.0.0 JaninoRuntimeException

2016-08-16 Thread Aris
My error is specifically: Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class "org.apache.spark.sql.catalyst.ex > pressions.GeneratedClass$SpecificOrdering" grows beyond 64

Re: Sum array values by row in new column

2016-08-16 Thread Javier Rey
Hi, Thanks!! this works, but I also need mean :) I am finding way. Regards. 2016-08-16 5:30 GMT-05:00 ayan guha : > Here is a more generic way of doing this: > > from pyspark.sql import Row > df = sc.parallelize([[1,2,3,4],[10,20,30]]).map(lambda x: >

Re: Rebalancing when adding kafka partitions

2016-08-16 Thread Srikanth
Yes, SubscribePattern detects new partition. Also, it has a comment saying Subscribe to all topics matching specified pattern to get dynamically > assigned partitions. > * The pattern matching will be done periodically against topics existing > at the time of check. > * @param pattern pattern

Large where clause StackOverflow 1.5.2

2016-08-16 Thread rachmaninovquartet
Hi, I'm trying to implement a folding function in Spark, it takes an input k and a data frame of ids and dates. k=1 will be just the data frame, k=2 will, consist of the min and max date for each id once and the rest twice, k=3 will consist of min and max once, min+1 and max-1, twice and the rest

Re: Spark Executor Metrics

2016-08-16 Thread Otis Gospodnetić
Hi Muhammad, You should give people a bit more time to answer/help you (for free). :) I don't have direct answer for you, but you can look at SPM for Spark , which has all the instructions for getting all Spark metrics (Executors,

Re: DataFrame use case

2016-08-16 Thread Sean Owen
I'd say that Datasets, not DataFrames, are the natural evolution of RDDs. DataFrames are for inherently tabular data, and most naturally manipulated by SQL-like operations. Datasets operate on programming language objects like RDDs. So, RDDs to DataFrames isn't quite apples-to-apples to begin

DataFrame use case

2016-08-16 Thread jtgenesis
Hey guys, I've been digging around trying to figure out if I should transition from RDDs to DataFrames. I'm currently using RDDs to represent tiles of binary imagery data and I'm wondering if representing the data as a DataFrame is a better solution. To get my feet wet, I did a little comparison

VectorUDT with spark.ml.linalg.Vector

2016-08-16 Thread alexeys
I am writing an UDAF to be applied to a data frame column of type Vector (spark.ml.linalg.Vector). I rely on spark/ml/linalg so that I do not have to go back and forth between dataframe and RDD. Inside the UDAF, I have to specify a data type for the input, buffer, and output (as usual).

GraphFrames 0.2.0 released

2016-08-16 Thread Tim Hunter
Hello all, I have released version 0.2.0 of the GraphFrames package. Apart from a few bug fixes, it is the first release published for Spark 2.0 and both scala 2.10 and 2.11. Please let us know if you have any comment or questions. It is available as a Spark package:

Re: Spark 2.0.0 JaninoRuntimeException

2016-08-16 Thread Ted Yu
Can you take a look at commit fa244e5a90690d6a31be50f2aa203ae1a2e9a1cf ? There was a test: SPARK-15285 Generated SpecificSafeProjection.apply method grows beyond 64KB See if it matches your use case. On Tue, Aug 16, 2016 at 8:41 AM, Aris wrote: > I am still working on

Re: Spark 2.0.0 JaninoRuntimeException

2016-08-16 Thread Aris
I am still working on making a minimal test that I can share without my work-specific code being in there. However, the problem occurs with a dataframe with several hundred columns being asked to do a tension split. Random split works with up to about 350 columns so far. It breaks in my code with

Re: Spark Executor Metrics

2016-08-16 Thread Muhammad Haris
Still waiting for response, any clue/suggestions? On Tue, Aug 16, 2016 at 4:48 PM, Muhammad Haris < muhammad.haris.makh...@gmail.com> wrote: > Hi, > I have been trying to collect driver, master, worker and executors metrics > using Spark 2.0 in standalone mode, here is what my metrics

Re: submitting spark job with kerberized Hadoop issue

2016-08-16 Thread Aneela Saleem
Thanks Steve, I went through this but still not able to fix the issue On Mon, Aug 15, 2016 at 2:01 AM, Steve Loughran wrote: > Hi, > > Just came across this while going through all emails I'd left unread over > my vacation. > > did you manage to fix this? > > 1. There's

Re: long lineage

2016-08-16 Thread Ted Yu
Have you tried periodic checkpoints ? Cheers > On Aug 16, 2016, at 5:50 AM, pseudo oduesp wrote: > > Hi , > how we can deal after raise stackoverflow trigger by long lineage ? > i mean i have this error and how resolve it wiyhout creating new session > thanks >

long lineage

2016-08-16 Thread pseudo oduesp
Hi , how we can deal after raise stackoverflow trigger by long lineage ? i mean i have this error and how resolve it wiyhout creating new session thanks

Spark Executor Metrics

2016-08-16 Thread Muhammad Haris
Hi, I have been trying to collect driver, master, worker and executors metrics using Spark 2.0 in standalone mode, here is what my metrics configuration file looks like: *.sink.csv.class=org.apache.spark.metrics.sink.CsvSink *.sink.csv.period=1 *.sink.csv.unit=seconds

Re: Data frame Performance

2016-08-16 Thread Mich Talebzadeh
Hi Selvan, is table called sel,? And are these assumptions correct? site -> ColA requests -> ColB I don't think you are using ColC here? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: class not found exception Logging while running JavaKMeansExample

2016-08-16 Thread Ted Yu
The class is: core/src/main/scala/org/apache/spark/internal/Logging.scala So it is in spark-core. On Tue, Aug 16, 2016 at 2:33 AM, subash basnet wrote: > Hello Yuzhihong, > > I didn't get how to implement what you said in the JavaKMeansExample.java. > As I get the logging

Unsubscribe

2016-08-16 Thread Martin Serrano
Sent from my Verizon Wireless 4G LTE DROID

Re: Spark 2.0.0 JaninoRuntimeException

2016-08-16 Thread Ted Yu
I think we should reopen it. > On Aug 16, 2016, at 1:48 AM, Kazuaki Ishizaki wrote: > > I just realized it since it broken a build with Scala 2.10. > https://github.com/apache/spark/commit/fa244e5a90690d6a31be50f2aa203ae1a2e9a1cf > > I can reproduce the problem in

Data frame Performance

2016-08-16 Thread Selvam Raman
Hi All, Please suggest me the best approach to achieve result. [ Please comment if the existing logic is fine or not] Input Record : ColA ColB ColC 1 2 56 1 2 46 1 3 45 1 5 34 1 5 90 2 1 89 2 5 45 ​ Expected Result ResA ResB 12:2|3:3|5:5 2 1:1|5:5 I followd the below

Re: Sum array values by row in new column

2016-08-16 Thread ayan guha
Here is a more generic way of doing this: from pyspark.sql import Row df = sc.parallelize([[1,2,3,4],[10,20,30]]).map(lambda x: Row(numbers=x)).toDF() df.show() from pyspark.sql.functions import udf from pyspark.sql.types import IntegerType u = udf(lambda c: sum(c), IntegerType()) df1 =

Re: Accessing HBase through Spark with Security enabled

2016-08-16 Thread Aneela Saleem
Thanks Steve, I have gone through it's documentation, i did not get any idea how to install it. Can you help me? On Mon, Aug 15, 2016 at 4:23 PM, Steve Loughran wrote: > > On 15 Aug 2016, at 08:29, Aneela Saleem wrote: > > Thanks Jacek! > > I

Re: class not found exception Logging while running JavaKMeansExample

2016-08-16 Thread subash basnet
Hello Yuzhihong, I didn't get how to implement what you said in the JavaKMeansExample.java. As I get the logging exception as while creating the spark session: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/internal/Logging at

Re: Linear regression, weights constraint

2016-08-16 Thread Taras Lehinevych
Thank you a lot for the answer. Have a nice day. On Tue, Aug 16, 2016 at 10:55 AM Yanbo Liang wrote: > Spark MLlib does not support boxed constraints on model coefficients > currently. > > Thanks > Yanbo > > 2016-08-15 3:53 GMT-07:00 letaiv : > >> Hi

Re: Spark's Logistic Regression runs unstable on Yarn cluster

2016-08-16 Thread Yanbo Liang
Could you check the log to see how much iterations does your LoR runs? Does your program output same model between different attempts? Thanks Yanbo 2016-08-12 3:08 GMT-07:00 olivierjeunen : > I'm using pyspark ML's logistic regression implementation to do some >

Re: java.lang.UnsupportedOperationException: Cannot evaluate expression: fun_nm(input[0, string, true])

2016-08-16 Thread Sumit Khanna
This is just the stacktrace,but where is it you ccalling the UDF? Regards, Sumit On 16-Aug-2016 2:20 pm, "pseudo oduesp" wrote: > hi, > i cretae new columns with udf after i try to filter this columns : > i get this error why ? > > :

Re: Linear regression, weights constraint

2016-08-16 Thread Yanbo Liang
Spark MLlib does not support boxed constraints on model coefficients currently. Thanks Yanbo 2016-08-15 3:53 GMT-07:00 letaiv : > Hi all, > > Is there any approach to add constrain for weights in linear regression? > What I need is least squares regression with

java.lang.UnsupportedOperationException: Cannot evaluate expression: fun_nm(input[0, string, true])

2016-08-16 Thread pseudo oduesp
hi, i cretae new columns with udf after i try to filter this columns : i get this error why ? : java.lang.UnsupportedOperationException: Cannot evaluate expression: fun_nm(input[0, string, true]) at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:221) at

Re: Spark 2.0.0 JaninoRuntimeException

2016-08-16 Thread Kazuaki Ishizaki
I just realized it since it broken a build with Scala 2.10. https://github.com/apache/spark/commit/fa244e5a90690d6a31be50f2aa203ae1a2e9a1cf I can reproduce the problem in SPARK-15285 with master branch. Should we reopen SPARK-15285? Best Regards, Kazuaki Ishizaki, From: Ted Yu

MLIB and R results do not match for SVD

2016-08-16 Thread roni
Hi All, Some time back I had asked the question about PCA results not matching between R and MLIB. I was suggested to use svd.v instead of PCA to match the uncentered PCA . But the results of mlib and R for svd do not match .(I can understand the numbers not matching exactly) but the