SparkML RandomForest

2016-08-10 Thread Pengcheng
Hi There, I was comparing Randomforest in sparkml(org.apache.spark.ml.classification) and spark mllib(org.apache.spark.mllib.tree) using the same datasets and same parameter settings, spark mllib always gives me better results on test data sets. I was wondering 1. Did anyone notice similar

Re: Is there a reduceByKey functionality in DataFrame API?

2016-08-10 Thread Holden Karau
Hi Luis, You might want to consider upgrading to Spark 2.0 - but in Spark 1.6.2 you can do groupBy followed by a reduce on the GroupedDataset ( http://spark.apache.org/docs/1.6.2/api/scala/index.html#org.apache.spark.sql.GroupedDataset ) - this works on a per-key basis despite the different name.

Re: groupByKey() compile error after upgrading from 1.6.2 to 2.0.0

2016-08-10 Thread Arun Luthra
Thanks, pair_rdd.rdd.groupByKey() did the trick. On Wed, Aug 10, 2016 at 8:24 PM, Holden Karau wrote: > So it looks like (despite the name) pair_rdd is actually a Dataset - my > guess is you might have a map on a dataset up above which used to return an > RDD but now

Re: groupByKey() compile error after upgrading from 1.6.2 to 2.0.0

2016-08-10 Thread Holden Karau
So it looks like (despite the name) pair_rdd is actually a Dataset - my guess is you might have a map on a dataset up above which used to return an RDD but now returns another dataset or an unexpected implicit conversion. Just add rdd() before the groupByKey call to push it into an RDD. That being

groupByKey() compile error after upgrading from 1.6.2 to 2.0.0

2016-08-10 Thread Arun Luthra
Here is the offending line: val some_rdd = pair_rdd.groupByKey().flatMap { case (mk: MyKey, md_iter: Iterable[MyData]) => { ... [error] .scala:249: overloaded method value groupByKey with alternatives: [error] [K](func: org.apache.spark.api.java.function.MapFunction[(aaa.MyKey,

na.fill doesn't work

2016-08-10 Thread Javier Rey
Hi everybody, I have a data frame after many transformation, my final task is fill na's with zeros, but I run this command : df_fil1 = df_fil.na.fill(0), but this command doesn't work nulls doesn't disappear. I did a toy test it works correctly. I don't understand what happend. Thanks in

Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2

2016-08-10 Thread Cheng Lian
Haven't figured out the exactly way how it failed, but the leading underscore in the partition directory name looks suspicious. Could you please try this PR to see whether it fixes the issue: https://github.com/apache/spark/pull/14585/files Cheng On 8/9/16 5:38 PM, immerrr again wrote:

Is there a reduceByKey functionality in DataFrame API?

2016-08-10 Thread luismattor
Hi everyone, Consider the following code: val result = df.groupBy("col1").agg(min("col2")) I know that rdd.reduceByKey(func) produces the same RDD as rdd.groupByKey().mapValues(value => value.reduce(func)) However reducerByKey is more efficient as it avoids shipping each value to the reducer

Re: Spark2 SBT Assembly

2016-08-10 Thread Efe Selcuk
Thanks for the replies, folks. My specific use case is maybe unusual. I'm working in the context of the build environment in my company. Spark was being used in such a way that the fat assembly jar that the old 'sbt assembly' command outputs was used when building a spark applicaiton. I'm trying

Re: Spark submit job that points to URL of a jar

2016-08-10 Thread Mich Talebzadeh
you can build your uber jar file on an NFS mounted file system accessible to all nodes in the cluster. Any node then can start-submit and run the app referring to the jar file. sounds doable. Having thought about it, it is feasible to place Spark binaries on the NFS mount as well so any host can

Spark submit job that points to URL of a jar

2016-08-10 Thread Zlati Gardev
Hello,   Is there a way to run a spark submit job that points to the URL of a jar file (instead of pushing the jar from local)?    The documentation  at http://spark.apache.org/docs/latest/submitting-applications.html  implies that this may be possible.   "application-jar: Path to a bundled jar

Re: Spark2 SBT Assembly

2016-08-10 Thread Mich Talebzadeh
Hi Efe, Are you talking about creating an uber/fat jar file for your specific application? Then you can distribute it to another node just to use the jar file without assembling it. I can still do it in Spark 2 as before if I understand your special use case. [warn] Strategy 'discard' was

Re: Spark2 SBT Assembly

2016-08-10 Thread Marco Mistroni
How bout all dependencies? Presumably they will all go in --jars ? What if I have 10 dependencies? Any best practices in packaging apps for spark 2.0? Kr On 10 Aug 2016 6:46 pm, "Nick Pentreath" wrote: > You're correct - Spark packaging has been shifted to not use the

Re: Standardization with Sparse Vectors

2016-08-10 Thread Tobi Bosede
Sean, I have created a jira; I hope you don't mind that I borrowed your explanation of "offset". https://issues.apache.org/jira/browse/SPARK-17001 So what did you do to standardize your data, if you didn't use standardScaler? Did you write a udf to subtract mean and divide by standard deviation?

Re: Spark2 SBT Assembly

2016-08-10 Thread Holden Karau
What are you looking to use the assembly jar for - maybe we can think of a workaround :) On Wednesday, August 10, 2016, Efe Selcuk wrote: > Sorry, I should have specified that I'm specifically looking for that fat > assembly behavior. Is it no longer possible? > > On Wed,

org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0

2016-08-10 Thread شجاع الرحمن بیگ
Hi, I am getting following error while processing large input size. ... [Stage 18:> (90 + 24) / 240]16/08/10 19:39:54 WARN TaskSetManager: Lost task 86.1 in stage 18.0 (TID 2517, bscpower8n2-data): FetchFailed(null, shuffleId=0, mapId=-1,

UNSUBSCRIBE

2016-08-10 Thread Sheth, Niraj

Re: Spark streaming not processing messages from partitioned topics

2016-08-10 Thread Diwakar Dhanuskodi
Checked executor logs and UI . There is no error message or something like that. when there is any action , it is waiting . There are data in partitions. I could use simple-consumer-shell and print all data in console. Am I doing anything wrong in foreachRDD?. This just works fine with single

Re: Standardization with Sparse Vectors

2016-08-10 Thread Nick Pentreath
Ah right, got it. As you say for storage it helps significantly, but for operations I suspect it puts one back in a "dense-like" position. Still, for online / mini-batch algorithms it may still be feasible I guess. On Wed, 10 Aug 2016 at 19:50, Sean Owen wrote: > All

Re: Spark2 SBT Assembly

2016-08-10 Thread Nick Pentreath
You're correct - Spark packaging has been shifted to not use the assembly jar. To build now use "build/sbt package" On Wed, 10 Aug 2016 at 19:40, Efe Selcuk wrote: > Hi Spark folks, > > With Spark 1.6 the 'assembly' target for sbt would build a fat jar with > all of the

Re: Spark2 SBT Assembly

2016-08-10 Thread Efe Selcuk
Sorry, I should have specified that I'm specifically looking for that fat assembly behavior. Is it no longer possible? On Wed, Aug 10, 2016 at 10:46 AM, Nick Pentreath wrote: > You're correct - Spark packaging has been shifted to not use the assembly > jar. > > To

Re: Standardization with Sparse Vectors

2016-08-10 Thread Sean Owen
All elements, I think. Imagine a sparse vector 1:3 3:7 which conceptually represents 0 3 0 7. Imagine it also has an offset stored which applies to all elements. If it is -2 then it now represents -2 1 -2 5, but this requires just one extra value to store. It only helps with storage of a shifted

Spark2 SBT Assembly

2016-08-10 Thread Efe Selcuk
Hi Spark folks, With Spark 1.6 the 'assembly' target for sbt would build a fat jar with all of the main Spark dependencies for building an application. Against Spark 2, that target is no longer building a spark assembly, just ones for e.g. Flume and Kafka. I'm not well versed with maven and sbt,

Simulate serialization when running local

2016-08-10 Thread Ashic Mahtab
Hi,Is there a way to simulate "networked" spark when running local (i.e. master=local[4])? Ideally, some setting that'll ensure any "Task not serializable" errors are caught during local testing? I seem to vaguely remember something, but am having trouble pinpointing it. Cheers,Ashic.

Re: Standardization with Sparse Vectors

2016-08-10 Thread Nick Pentreath
Sean by 'offset' do you mean basically subtracting the mean but only from the non-zero elements in each row? On Wed, 10 Aug 2016 at 19:02, Sean Owen wrote: > Yeah I had thought the same, that perhaps it's fine to let the > StandardScaler proceed, if it's explicitly asked to

Re: Running spark Java on yarn cluster

2016-08-10 Thread atulp
Thanks Mandar. Our need is to get sql queries from client and submit over spark cluster. We don't want application to get submitted for each query. We want executors to get shared across multiple queries as we would cache rdds which would get used across queries. If I am correct, spark context

Re: Standardization with Sparse Vectors

2016-08-10 Thread Sean Owen
Yeah I had thought the same, that perhaps it's fine to let the StandardScaler proceed, if it's explicitly asked to center, rather than refuse to. It's not really much more rope to let a user hang herself with, and, blocks legitimate usages (we ran into this last week and couldn't use

Re: Changing Spark configuration midway through application.

2016-08-10 Thread Andrew Ehrlich
If you're changing properties for the SparkContext, then I believe you will have to start a new SparkContext with the new properties. On Wed, Aug 10, 2016 at 8:47 AM, Jestin Ma wrote: > If I run an application, for example with 3 joins: > > [join 1] > [join 2] > [join

Re: Standardization with Sparse Vectors

2016-08-10 Thread Tobi Bosede
Thanks Sean, I agree with 100% that the math is math and dense vs sparse is just a matter of representation. I was trying to convince a co-worker of this to no avail. Sending this email was mainly a sanity check. I think having an offset would be a great idea, although I am not sure how to

Re: unsubscribe

2016-08-10 Thread Matei Zaharia
To unsubscribe, please send an email to user-unsubscr...@spark.apache.org from the address you're subscribed from. Matei > On Aug 10, 2016, at 12:48 PM, Sohil Jain wrote: > > - To unsubscribe

unsubscribe

2016-08-10 Thread Sohil Jain

Re: Standardization with Sparse Vectors

2016-08-10 Thread Sean Owen
Dense vs sparse is just a question of representation, so doesn't make an operation on a vector more or less important as a result. You've identified the reason that subtracting the mean can be undesirable: a notionally billion-element sparse vector becomes too big to fit in memory at once. I know

Changing Spark configuration midway through application.

2016-08-10 Thread Jestin Ma
If I run an application, for example with 3 joins: [join 1] [join 2] [join 3] [final join and save to disk] Could I change Spark properties in between each join? [join 1] [change properties] [join 2] [change properties] ... Or would I have to create a separate application with different

Re: Spark 2.0.0 - Apply schema on few columns of dataset

2016-08-10 Thread Aseem Bansal
To those interested I changed the data frame to RDD. Then I created a data frame. That has an option of giving a schema. But probably someone should improve how to use the as function. On Mon, Aug 8, 2016 at 1:05 PM, Ewan Leith wrote: > Hmm I’m not sure, I don’t

Standardization with Sparse Vectors

2016-08-10 Thread Tobi Bosede
Hi everyone, I am doing some standardization using standardScaler on data from VectorAssembler which is represented as sparse vectors. I plan to fit a regularized model. However, standardScaler does not allow the mean to be subtracted from sparse vectors. It will only divide by the standard

Re: Spark 1.6.2 can read hive tables created with sqoop, but Spark 2.0.0 cannot

2016-08-10 Thread cdecleene
Using the scala api instead of the python api yields the same results. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-6-2-can-read-hive-tables-created-with-sqoop-but-Spark-2-0-0-cannot-tp27502p27506.html Sent from the Apache Spark User List mailing

Re: Spark SQL Parallelism - While reading from Oracle

2016-08-10 Thread @Sanjiv Singh
Use it You can set up all the properties (driver,partitionColumn, lowerBound, upperBound, numPartitions) you should start with the driver at first. Now you have the maximum id so you can use it for the upperBound parameter. The numPartitions now based on your table's dimensions and your

Use cases around image/video processing in spark

2016-08-10 Thread Deepak Sharma
Hi If anyone is using or knows about github repo that can help me get started with image and video processing using spark. The images/videos will be stored in s3 and i am planning to use s3 with Spark. In this case , how will spark achieve distributed processing? Any code base or references is

Re: Spark streaming not processing messages from partitioned topics

2016-08-10 Thread Cody Koeninger
zookeeper.connect is irrelevant. Did you look at your executor logs? Did you look at the UI for the (probably failed) stages? Are you actually producing data into all of the kafka partitions? If you use kafka-simple-consumer-shell.sh to read that partition, do you get any data? On Wed, Aug 10,

Re: Spark streaming not processing messages from partitioned topics

2016-08-10 Thread Diwakar Dhanuskodi
Hi Cody, Just added zookeeper.connect to kafkaparams . It couldn't come out of batch window. Other batches are queued. I could see foreach(println) of dataFrame printing one of partition's data and not the other. Couldn't see any errors from log. val brokers = "localhost:9092,localhost:9093"

UNSUBSCRIBE

2016-08-10 Thread Sudhanshu Janghel

Re: UNSUBSCRIBE

2016-08-10 Thread Nicholas Chammas
Please follow the links here to unsubscribe: http://spark.apache.org/community.html On Tue, Aug 9, 2016 at 5:14 PM abhishek singh wrote: > >

Re: UNSUBSCRIBE

2016-08-10 Thread Nicholas Chammas
Please follow the links here to unsubscribe: http://spark.apache.org/community.html On Tue, Aug 9, 2016 at 8:03 PM James Ding wrote: > >

Re: UNSUBSCRIBE

2016-08-10 Thread Nicholas Chammas
Please follow the links here to unsubscribe: http://spark.apache.org/community.html On Wed, Aug 10, 2016 at 2:46 AM Martin Somers wrote: > > > -- > M >

Re: Unsubscribe

2016-08-10 Thread Nicholas Chammas
Please follow the links here to unsubscribe: http://spark.apache.org/community.html On Tue, Aug 9, 2016 at 3:02 PM Hogancamp, Aaron < aaron.t.hoganc...@leidos.com> wrote: > Unsubscribe. > > > > Thanks, > > > > Aaron Hogancamp > > Data Scientist > > >

Re: Unsubscribe.

2016-08-10 Thread Nicholas Chammas
Please follow the links here to unsubscribe: http://spark.apache.org/community.html On Tue, Aug 9, 2016 at 3:05 PM Martin Somers wrote: > Unsubscribe. > > Thanks > M >

suggestion needed on FileInput Path- Spark Streaming

2016-08-10 Thread mdkhajaasmath
what is best practice while processing files from s3 bucket in spark file streaming ?? Like I keep on getting files in s3 path, have to process those in batch but while processing some other files might come up. In this steaming job, should I have to move files after end of our streaming batch

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-10 Thread Sean Owen
Scaling can mean scaling factors up or down so that they're all on a comparable scale. It certainly changes the sum of squared errors, but, you can't compare this metric across scaled and unscaled data, exactly because one is on a totally different scale and will have quite different absolute

Re: Spark streaming not processing messages from partitioned topics

2016-08-10 Thread Cody Koeninger
Those logs you're posting are from right after your failure, they don't include what actually went wrong when attempting to read json. Look at your logs more carefully. On Aug 10, 2016 2:07 AM, "Diwakar Dhanuskodi" wrote: > Hi Siva, > > With below code, it is stuck

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-10 Thread Rohit Chaddha
Hi Sean, So basically I am trying to cluster a number of elements (its a domain object called PItem) based on a the quality factors of these items. These elements have 112 quality factors each. Now the issue is that when I am scaling the factors using StandardScaler I get a Sum of Squared Errors

Re: Spark streaming not processing messages from partitioned topics

2016-08-10 Thread Sivakumaran S
I am testing with one partition now. I am using Kafka 0.9 and Spark 1.6.1 (Scala 2.11). Just start with one topic first and then add more. I am not partitioning the topic. HTH, Regards, Sivakumaran > On 10-Aug-2016, at 5:56 AM, Diwakar Dhanuskodi > wrote: >

Spark SQL Parallelism - While reading from Oracle

2016-08-10 Thread Siva A
Hi Team, How do we increase the parallelism in Spark SQL. In Spark Core, we can re-partition or pass extra arguments part of the transformation. I am trying the below example, val df1 = sqlContext.read.format("jdbc").options(Map(...)).load val df2= df1.cache val df2.count Here count operation

Running spark Java on yarn cluster

2016-08-10 Thread atulp
Hi Team, I am new to spark and writing my first program. I have written sample program with spark master as local. To execute spark over local yarn what should be value of spark.master property? Can I point to remote yarn cluster? I would like to execute this as a java application and not

Running spark Java on yarn cluster

2016-08-10 Thread Atul Phalke
Hi Team, I am new to spark and writing my first program. I have written sample program with spark master as local. To execute spark over local yarn what should be value of spark.master property? Can I point to remote yarn cluster? I would like to execute this as a java application and not

RE: Spark join and large temp files

2016-08-10 Thread Ashic Mahtab
Already tried that. The CPU hits 100% on the collectAsMap (even tried foreaching to a java ConcurrentHashmap), and eventually finishes, but while broadcasting, it takes a while, and at some point there's some timeout, and the worker is killed. The driver (and workers) have more than enough RAM

Re: using matrix as column datatype in SparkSQL Dataframe

2016-08-10 Thread Yanbo Liang
A good way is to implement your own data source to load data of matrix format. You can refer the LibSVM data format ( https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/ml/source/libsvm) which contains one column of vector type which is very similar with matrix.

Re: Spark Thrift Server (Spark 2.0) show table has value with NULL in all fields

2016-08-10 Thread Mich Talebzadeh
Hi, Have you raised a Jira for this? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com

Re: Random forest binary classification H20 difference Spark

2016-08-10 Thread Yanbo Liang
Hi Samir, Did you use VectorAssembler to assemble some columns into the feature column? If there are NULLs in your dataset, VectorAssembler will throw this exception. You can use DataFrame.drop() or DataFrame.replace() to drop/substitute NULL values. Thanks Yanbo 2016-08-07 19:51 GMT-07:00

Please help: Spark job hung/stop writing after exceeding the folder size

2016-08-10 Thread Bhupendra Mishra
Dear All, I have struggling with an issue where spark steam job gets hung after exceeding size of output folder path. here is more details: I have Flume sending and configuration agent1.sources = source1 agent1.sinks = sink1 agent1.channels = channel2 # Describe/configure source1

Re: Logistic regression formula string

2016-08-10 Thread Yanbo Liang
I think you can output the schema of DataFrame which will be feed into the estimator such as LogisticRegression. The output array will be the encoded feature names corresponding the coefficients of the model. Thanks Yanbo 2016-08-08 15:53 GMT-07:00 Cesar : > > I have a data

Re: Spark Thrift Server (Spark 2.0) show table has value with NULL in all fields

2016-08-10 Thread Chanh Le
Hi Gene, It's a Spark 2.0 issue. I switch to Spark 1.6.1 it's ok now. Thanks. On Thursday, July 28, 2016 at 4:25:48 PM UTC+7, Chanh Le wrote: > > Hi everyone, > > I have problem when I create a external table in Spark Thrift Server (STS) > and query the data. > > Scenario: > *Spark 2.0* >

Re: Spark streaming not processing messages from partitioned topics

2016-08-10 Thread Diwakar Dhanuskodi
Hi Siva, With below code, it is stuck up at * sqlContext.read.json(rdd.map(_._2)).toDF()* There are two partitions in topic. I am running spark 1.6.2 val topics = "topic.name" val brokers = "localhost:9092" val topicsSet = topics.split(",").toSet val sparkConf = new

Re: Change nullable property in Dataset schema

2016-08-10 Thread Kazuaki Ishizaki
After some investigations, I was able to change nullable property in Dataset[Array[Int]] in the following way. Is this right way? (1) Apply https://github.com/apache/spark/pull/13873 (2) Use two Encoders. One is RowEncoder. The other is predefined ExressionEncoder. class Test extends QueryTest

UNSUBSCRIBE

2016-08-10 Thread Martin Somers
-- M