Re: Spark2 SBT Assembly

2016-08-11 Thread Jacek Laskowski
Hi, What about...libraryDependencies in build.sbt with % Provided + sbt-assembly + sbt assembly = DONE. Not much has changed since. Jacek On 11 Aug 2016 11:29 a.m., "Efe Selcuk" wrote: > Bump! > > On Wed, Aug 10, 2016 at 2:59 PM, Efe Selcuk wrote: >

Log messages for shuffle phase

2016-08-11 Thread Suman Somasundar
Hi, While going through the logs of an application, I noticed that I could not find any logs to dig deeper into any of the shuffle phases. I am interested in finding out time taken by each shuffle phase, the size of data spilled to disk if any, among other things. Does anyone know

Re: DataFramesWriter saving DataFrames timestamp in weird format

2016-08-11 Thread Hyukjin Kwon
Do you mind if I ask which format you used to save the data? I guess you used CSV and there is a related PR open here https://github.com/apache/spark/pull/14279#issuecomment-237434591 2016-08-12 6:04 GMT+09:00 Jestin Ma : > When I load in a timestamp column and try

Losing executors due to memory problems

2016-08-11 Thread Muttineni, Vinay
Hello, I have a spark job that basically reads data from two tables into two Dataframes which are subsequently converted to RDD's. I, then, join them based on a common key. Each table is about 10 TB in size but after filtering, the two RDD's are about 500GB each. I have 800 executors with 8GB

RE: Spark join and large temp files

2016-08-11 Thread Ashic Mahtab
Hi Ben,Already tried that. The thing is that any form of shuffle on the big dataset (which repartition will cause) puts a node's chunk into /tmp, and that fill up disk. I solved the problem by storing the 1.5GB dataset in an embedded redis instance on the driver, and doing a straight flatmap of

DataFramesWriter saving DataFrames timestamp in weird format

2016-08-11 Thread Jestin Ma
When I load in a timestamp column and try to save it immediately without any transformations, the output time is unix time with padded 0's until there are 16 values. For example, loading in a time of August 3, 2016, 00:36:25 GMT, which is 1470184585 in UNIX time, saves as 147018458500. When

Re: Spark 2 cannot create ORC table when CLUSTERED. This worked in Spark 1.6.1

2016-08-11 Thread Gourav Sengupta
And SPARK even reads ORC data very slowly. And in case the HIVE table is partitioned, then it just hangs. Regards, Gourav On Thu, Aug 11, 2016 at 6:02 PM, Mich Talebzadeh wrote: > > > This does not work with CLUSTERED BY clause in Spark 2 now! > > CREATE TABLE

Re: Spark join and large temp files

2016-08-11 Thread Ben Teeuwen
Hmm, hashing will probably send all of the records with the same key to the same partition / machine. I’d try it out, and hope that if you have a few superlarge keys bigger than the RAM available of one node, they spill to disk. Maybe play with persist() and using a different Storage Level. >

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-11 Thread Nick Pentreath
Ok, interesting. Would be interested to see how it compares. By the way, the feature size you select for the hasher should be a power of 2 (e.g. 2**24 to 2**26 may be worth trying) to ensure the feature indexes are evenly distributed (see the section on HashingTF under

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-11 Thread Ben Teeuwen
Thanks Nick, I played around with the hashing trick. When I set numFeatures to the amount of distinct values for the largest sparse feature, I ended up with half of them colliding. When raising the numFeatures to have less collisions, I soon ended up with the same memory problems as before. To

Re: Single point of failure with Driver host crashing

2016-08-11 Thread Mich Talebzadeh
Thanks Ted, In this case we were using Standalone with Standalone master started on another node. The app was started on a node but not the master node. The master node was not affected. The node in question was the edge (running spark-submit). >From the link I was not sure this matter would

Re: Single point of failure with Driver host crashing

2016-08-11 Thread Ted Yu
Have you read https://spark.apache.org/docs/latest/spark-standalone.html#high-availability ? FYI On Thu, Aug 11, 2016 at 12:40 PM, Mich Talebzadeh wrote: > > Hi, > > Although Spark is fault tolerant when nodes go down like below: > > FROM tmp > [Stage 1:===>

Re: Spark join and large temp files

2016-08-11 Thread Ben Teeuwen
When you read both ‘a’ and ‘b', can you try repartitioning both by column ‘id’? If you .cache() and .count() to force a shuffle, it'll push the records that will be joined to the same executors. So; a = spark.read.parquet(‘path_to_table_a’).repartition(‘id’).cache() a.count() b =

Single point of failure with Driver host crashing

2016-08-11 Thread Mich Talebzadeh
Hi, Although Spark is fault tolerant when nodes go down like below: FROM tmp [Stage 1:===> (20 + 10) / 100]16/08/11 20:21:34 ERROR TaskSchedulerImpl: Lost executor 3 on xx.xxx.197.216: worker lost [Stage 1:>

Re: Why training data in Kmeans Spark streaming clustering

2016-08-11 Thread Bryan Cutler
The algorithm update is just broken into 2 steps: trainOn - to learn/update the cluster centers, and predictOn - predicts cluster assignment on data The StreamingKMeansExample you reference breaks up data into training and test because you might want to score the predictions. If you don't care

Re: Spark streaming not processing messages from partitioned topics

2016-08-11 Thread Diwakar Dhanuskodi
Figured it out. All I am doing wrong is testing it out in pseudo node vm with 1 core. The tasks were hanging out for cpu. In production cluster this works just fine. On Thu, Aug 11, 2016 at 12:45 AM, Diwakar Dhanuskodi < diwakar.dhanusk...@gmail.com> wrote: > Checked executor logs and UI .

Re: Spark2 SBT Assembly

2016-08-11 Thread Efe Selcuk
Bump! On Wed, Aug 10, 2016 at 2:59 PM, Efe Selcuk wrote: > Thanks for the replies, folks. > > My specific use case is maybe unusual. I'm working in the context of the > build environment in my company. Spark was being used in such a way that > the fat assembly jar that the

Re: Standardization with Sparse Vectors

2016-08-11 Thread Tobi Bosede
Can someone also provide input on why my code may not be working? Below, I have pasted part of my previous reply which describes the issue I am having here. I am really more perplexed about the first set of code (in bold). I know why the second set of code doesn't work, it is just something I

Re: Standardization with Sparse Vectors

2016-08-11 Thread Sean Owen
I should be more clear, since the outcome of the discussion above was not that obvious actually. - I agree a change should be made to StandardScaler, and not VectorAssembler - However I do think withMean should still be false by default and be explicitly enabled - The 'offset' idea is orthogonal,

Spark 2 cannot create ORC table when CLUSTERED. This worked in Spark 1.6.1

2016-08-11 Thread Mich Talebzadeh
This does not work with CLUSTERED BY clause in Spark 2 now! CREATE TABLE test.dummy2 ( ID INT , CLUSTERED INT , SCATTERED INT , RANDOMISED INT , RANDOM_STRING VARCHAR(50) , SMALL_VC VARCHAR(10) , PADDING VARCHAR(10) ) CLUSTERED BY (ID) INTO 256 BUCKETS STORED AS ORC

Re: Spark 1.6.2 can read hive tables created with sqoop, but Spark 2.0.0 cannot

2016-08-11 Thread cdecleene
The data is uncorrupted as I can create the dataframe from the underlying raw parquet from spark 2.0.0 if instead of using SparkSession.sql() to create a dataframe I use SparkSession.read.parquet(). -- View this message in context:

Re: Table registered using registerTempTable not found in HiveContext

2016-08-11 Thread Mich Talebzadeh
this is Spark 2 you create temp table from df using HiveContext val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc) scala> s.registerTempTable("tmp") scala> HiveContext.sql("select count(1) from tmp") res18: org.apache.spark.sql.DataFrame = [count(1): bigint] scala>

Spark 1.6.2 HiveServer2 cannot access temp tables

2016-08-11 Thread Richard M
Im attempting to access a dataframe from jdbc: However this temp table is not accessible from beeline when connected to this instance of HiveServer2. -- View this message in context:

Re: Table registered using registerTempTable not found in HiveContext

2016-08-11 Thread Richard M
How are you calling registerTempTable from hiveContext? It appears to be a private method. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Table-registered-using-registerTempTable-not-found-in-HiveContext-tp26555p27514.html Sent from the Apache Spark User

Re: HiveThriftServer and spark.sql.hive.thriftServer.singleSession setting

2016-08-11 Thread Richard M
I am running HiveServer2 as well and when I connect with beeline I get the following: org.apache.spark.sql.internal.SessionState cannot be cast to org.apache.spark.sql.hive.HiveSessionState Do you know how to resolve this? -- View this message in context:

Why training data in Kmeans Spark streaming clustering

2016-08-11 Thread Ahmed Sadek
Dear All, I was wondering why there is training data and testing data in kmeans ? Shouldn't it be unsupervised learning with just access to stream data ? I found similar question but couldn't understand the answer.

Re: Standardization with Sparse Vectors

2016-08-11 Thread Tobi Bosede
Opening this follow-up question to the entire mailing list. Anyone have thoughts on how I can add a column of dense vectors (created by converting a column of sparse features) to a data frame? My efforts are below. Although I know this is not the best approach for something I plan to put in

Re: update specifc rows to DB using sqlContext

2016-08-11 Thread Peyman Mohajerian
Alternatively, you should be able to write to a new table and use trigger or some other mechanism to update the particular row. I don't have any experience with this myself but just looking at this documentation:

dataframe row list question

2016-08-11 Thread vr spark
I have data which is json in this format myList: array |||-- elem: struct ||||-- nm: string (nullable = true) ||||-- vList: array (nullable = true) |||||-- element: string (containsNull = true) from my kafka stream, i created a dataframe

Re: na.fill doesn't work

2016-08-11 Thread Javier Rey
Thanks Assem I'll check this. Samir On Aug 11, 2016 4:39 AM, "Aseem Bansal" wrote: > Check the schema of the data frame. It may be that your columns are > String. You are trying to give default for numerical data. > > On Thu, Aug 11, 2016 at 6:28 AM, Javier Rey

Re: Spark excludes "fastutil" dependencies we need

2016-08-11 Thread cryptoe
I ran into a similar problem while using QDigest. Shading the clearspring, fastutil classes solves the problem. Snippet from pom.xml : package shade

Spark 2.0.0 - Java API - Modify a column in a dataframe

2016-08-11 Thread Aseem Bansal
Hi I have a Dataset I will change a String to String so there will be no schema changes. Is there a way I can run a map on it? I have seen the function at

Re: update specifc rows to DB using sqlContext

2016-08-11 Thread Mich Talebzadeh
in that case one alternative would be to save the new table on hdfs and then using some simple ETL load it into a staging table in MySQL and update the original table from staging table The whole thing can be done in a shell script. HTH Dr Mich Talebzadeh LinkedIn *

Re: update specifc rows to DB using sqlContext

2016-08-11 Thread sujeet jog
I read the table via spark SQL , and perform some ML activity on the data , and the resultant will be to update some specific columns with the ML improvised result, hence i do not have a option to do the whole operation in MySQL, Thanks, Sujeet On Thu, Aug 11, 2016 at 3:29 PM, Mich Talebzadeh

Re: Standardization with Sparse Vectors

2016-08-11 Thread Sean Owen
No, that doesn't describe the change being discussed, since you've copied the discussion about adding an 'offset'. That's orthogonal. You're also suggesting making withMean=True the default, which we don't want. The point is that if this is *explicitly* requested, the scaler shouldn't refuse to

Re: update specifc rows to DB using sqlContext

2016-08-11 Thread Mich Talebzadeh
Ok it is clearer now. You are using Spark as the query tool on an RDBMS table? Read table via JDBC, write back updating certain records. I have not done this myself but I suspect the issue would be if Spark write will commit the transaction and maintains ACID compliance. (locking the rows etc).

Re: update specifc rows to DB using sqlContext

2016-08-11 Thread sujeet jog
1 ) using mysql DB 2 ) will be inserting/update/overwrite to the same table 3 ) i want to update a specific column in a record, the data is read via Spark SQL, on the below table which is read via sparkSQL, i would like to update the NumOfSamples column . consider DF as the dataFrame which holds

Re: na.fill doesn't work

2016-08-11 Thread Aseem Bansal
Check the schema of the data frame. It may be that your columns are String. You are trying to give default for numerical data. On Thu, Aug 11, 2016 at 6:28 AM, Javier Rey wrote: > Hi everybody, > > I have a data frame after many transformation, my final task is fill na's >

Re: Can't generate model for prediction

2016-08-11 Thread Zakaria Hili
here you can find more information about the code of my class " RandomForestRegression..java" : http://spark.apache.org/docs/latest/mllib-ensembles.html#regression ᐧ 2016-08-11 10:18 GMT+02:00 Zakaria Hili : > Hi, > > I recognize that spark can't save generated model on HDFS

Can't generate model for prediction

2016-08-11 Thread Zakaria Hili
Hi, I recognize that spark can't save generated model on HDFS (I'm used random forest regression and linear regression for this test). it can save only the data directory as you can see in the picture bellow : [image: Images intégrées 1] but to load a model I will need some data from metadata

Re: spark 2.0 readStream from a REST API

2016-08-11 Thread Sela, Amit
The current available output modes are Complete and Append. Complete mode is for stateful processing (aggregations), and Append mode for stateless processing (I.e., map/filter). See : http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes

Re: Random forest binary classification H20 difference Spark

2016-08-11 Thread Bedrytski Aliaksandr
Hi Samir, either use *dataframe.na.fill()* method or the *nvl()* UDF when selecting features: val train = sqlContext.sql("SELECT ... nvl(Field, 1.0) AS Field ... FROM test") -- Bedrytski Aliaksandr sp...@bedryt.ski On Wed, Aug 10, 2016, at 11:19, Yanbo Liang wrote: > Hi Samir, > > Did