Re: Spark SQL: Merge Arrays/Sets

2016-07-11 Thread Yash Sharma
This answers exactly what you are looking for - http://stackoverflow.com/a/34204640/1562474 On Tue, Jul 12, 2016 at 6:40 AM, Pedro Rodriguez wrote: > Is it possible with Spark SQL to merge columns whose types are Arrays or > Sets? > > My use case would be something

Re: Fast database with writes per second and horizontal scaling

2016-07-11 Thread Yash Sharma
Spark is more of an execution engine rather than a database. Hive is a data warehouse but I still like treating it as an execution engine. For databases, You could compare HBase and Cassandra as they both have very wide usage and proven performance. We have used Cassandra in the past and were

Re: Spark cluster tuning recommendation

2016-07-11 Thread Yash Sharma
I would say use the dynamic allocation rather than number of executors. Provide some executor memory which you would like. Deciding the values requires couple of test runs and checking what works best for you. You could try something like - --driver-memory 1G \ --executor-memory 2G \

Fwd: Fast database with writes per second and horizontal scaling

2016-07-11 Thread ayan guha
HI HBase is pretty neat itself. But speed is not the criteria to choose Hbase over Cassandra (or vicey versa).. Slowness can very well because of design issues, and unfortunately it will not help changing technology in that case :) I would suggest you to quantify "slow"-ness in conjunction with

Re: Fast database with writes per second and horizontal scaling

2016-07-11 Thread Ashok Kumar
Anyone in Spark as well My colleague has been using Cassandra. However, he says it is too slow and not user friendly/MongodDB as a doc databases is pretty neat but not fast enough May main concern is fast writes per second and good scaling. Hive on Spark or Tez? How about Hbase. or anything else

Re: Spark cluster tuning recommendation

2016-07-11 Thread Anuj Kumar
That configuration looks bad. With only two cores in use and 1GB used by the app. Few points- 1. Please oversubscribe those CPUs to at-least twice the amount of cores you have to start-with and then tune if it freezes 2. Allocate all of the CPU cores and memory to your running app (I assume it is

Complications with saving Kafka offsets?

2016-07-11 Thread BradleyUM
I'm working on a Spark Streaming (1.6.0) project and one of our requirements is to persist Kafka offsets to Zookeeper after a batch has completed so that we can restart work from the correct position if we have to restart the process for any reason. Many links,

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread ayan guha
ccHi Mich Thanks for showing examples, makes perfect sense. One question: "...I agree that on VLT (very large tables), the limitation in available memory may be the overriding factor in using Spark"...have you observed any specific threshold for VLT which tilts the favor against Spark. For

Re: Batch details are missing

2016-07-11 Thread C. Josephson
The solution ended up being upgrading from Spark 1.5 to Spark 1.6.1+ On Fri, Jun 24, 2016 at 2:57 PM, C. Josephson wrote: > We're trying to resolve some performance issues with spark streaming using > the application UI, but the batch details page doesn't seem to be working. >

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Mich Talebzadeh
Another point with Hive on spark and Hive on Tez + LLAP, I am thinking loud :) 1. I am using Hive on Spark and I have a table of 10GB say with 100 users concurrently accessing the same partition of ORC table (last one hour or so) 2. Spark takes data and puts in in memory. I gather

chisqSelector in Python

2016-07-11 Thread Tobi Bosede
Hi all, There is no python example for chisqSelector in python at the below link. https://spark.apache.org/docs/1.4.1/mllib-feature-extraction.html#chisqselector So I am converting the scala code to python. I "translated" the following code val discretizedData = data.map { lp =>

Re: Error starting thrift server on Spark

2016-07-11 Thread Jacek Laskowski
Create the directory and start over. You've got history server enabled. Jacek On 11 Jul 2016 11:07 p.m., "Marco Colombo" wrote: Hi all, I cannot start thrift server on spark 1.6.2 I've configured binding port and IP and left default metastore. In logs I get:

QuantileDiscretizer not working properly with big dataframes

2016-07-11 Thread Pasquinell Urbani
Hi all, We have a dataframe with 2.5 millions of records and 13 features. We want to perform a logistic regression with this data but first we neet to divide each columns in discrete values using QuantileDiscretizer. This will improve the performance of the model by avoiding outliers. For small

Re: Spark hangs at "Removed broadcast_*"

2016-07-11 Thread dhruve ashar
Hi, Can you check the time when the job actually finished from the logs. The logs provided are too short and do not reveal meaningful information. On Mon, Jul 11, 2016 at 9:50 AM, velvetbaldmime wrote: > Spark 2.0.0-preview > > We've got an app that uses a fairly big

Spark cluster tuning recommendation

2016-07-11 Thread Kartik Mathur
I am trying a run terasort in spark , for a 7 node cluster with only 10g of data and executors get lost with GC overhead limit exceeded error. This is what my cluster looks like - - *Alive Workers:* 7 - *Cores in use:* 28 Total, 2 Used - *Memory in use:* 56.0 GB Total, 1024.0 MB Used

/spark-ec2 script: trouble using ganglia web ui spark 1.6.1

2016-07-11 Thread Andy Davidson
I created a cluster using spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2 script. The shows ganglia started how ever I am not able to access http://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:5080/ganglia. I have tried using the private ip from with in my data center. I d not see anything listing

Re: Spark Streaming - Direct Approach

2016-07-11 Thread Tathagata Das
Aah, the docs have not been updated. They are totally in production in many place. Others should chime in as well. On Mon, Jul 11, 2016 at 1:43 PM, Mail.com wrote: > Hi All, > > Can someone please confirm if streaming direct approach for reading Kafka > is still

Re: Spark Streaming - Direct Approach

2016-07-11 Thread Andy Davidson
Hi Pradeep I can not comment about experimental or production, how ever I recently started a POC using direct approach. Its been running off and on for about 2 weeks. In general it seems to work really well. One thing that is not clear to me is how the cursor is manage. E.G. I have my topic set

Error starting thrift server on Spark

2016-07-11 Thread Marco Colombo
Hi all, I cannot start thrift server on spark 1.6.2 I've configured binding port and IP and left default metastore. In logs I get: 16/07/11 22:51:46 INFO NettyBlockTransferService: Server created on 46717 16/07/11 22:51:46 INFO BlockManagerMaster: Trying to register BlockManager 16/07/11 22:51:46

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Mich Talebzadeh
In my test I did like for like keeping the systematic the same namely: 1. Table was a parquet table of 100 Million rows 2. The same set up was used for both Hive on Spark and Hive on MR 3. Spark was very impressive compared to MR on this particular test. Just to see any issues I

Spark Streaming - Direct Approach

2016-07-11 Thread Mail.com
Hi All, Can someone please confirm if streaming direct approach for reading Kafka is still experimental or can it be used for production use. I see the documentation and talk from TD suggesting the advantages of the approach but docs state it is an "experimental" feature. Please suggest

Processing ion formatted messages in spark

2016-07-11 Thread pandees waran
All, did anyone ever work on processing Ion formatted messages in Spark? Ion format is superset of JSON. All JSONs are valid IONs, but the reverse is not true. For more details on Ion; http://amznlabs.github.io/ion-docs/ Thanks.

Spark SQL: Merge Arrays/Sets

2016-07-11 Thread Pedro Rodriguez
Is it possible with Spark SQL to merge columns whose types are Arrays or Sets? My use case would be something like this: DF types id: String words: Array[String] I would want to do something like df.groupBy('id).agg(merge_arrays('words)) -> list of all words

Re: question about UDAF

2016-07-11 Thread Pedro Rodriguez
I am not sure I understand your code entirely, but I worked with UDAFs Friday and over the weekend ( https://gist.github.com/EntilZha/3951769a011389fef25e930258c20a2a). I think what is going on is that your "update" function is not defined correctly. Update should take a possibly initialized or

trouble accessing driver log files using rest-api

2016-07-11 Thread Andy Davidson
I am running spark-1.6.1 and the stand alone cluster manager. I am running into performance problems with spark streaming and added some extra metrics to my log files. I submit my app in cluster mode. (I.e. The driver runs on a slave not master) I am not able to get the driver log files while

Re: Saving Table with Special Characters in Columns

2016-07-11 Thread Tobi Bosede
Thanks Michael! But what about when I am not trying to save as parquet? No way around the error using saveAsTable()? I am using Spark 1.4. Tobi On Jul 11, 2016 2:10 PM, "Michael Armbrust" wrote: > This is protecting you from a limitation in parquet. The library will

Re: Custom Spark Error on Hadoop Cluster

2016-07-11 Thread Xiangrui Meng
(+user@spark. Please copy user@ so other people could see and help.) The error message means you have an MLlib jar on the classpath but it didn't contain ALS$StandardNNLSSolver. So it is either the modified jar not deployed to the workers or there existing an unmodified MLlib jar sitting in front

Re: Saving Table with Special Characters in Columns

2016-07-11 Thread Michael Armbrust
This is protecting you from a limitation in parquet. The library will let you write out invalid files that can't be read back, so we added this check. You can call .format("csv") (in spark 2.0) to switch it to CSV. On Mon, Jul 11, 2016 at 11:16 AM, Tobi Bosede wrote: > Hi

Run Stored Procedures from Spark SqlContext

2016-07-11 Thread zachkirsch
Hi, I have a SQL Server set up, and I also have a Spark cluster up and running that is executing Scala programs. I can connect to the SQL Server and query for data successfully. However, I need to call stored procedures from the Scala/Spark code (stored procedures that exist in the database) and

Saving Table with Special Characters in Columns

2016-07-11 Thread Tobi Bosede
Hi everyone, I am trying to save a data frame with special characters in the column names as a table in hive. However I am getting the following error. Is the only solution to rename all the columns? Or is there some argument that can be passed to into the saveAsTable() or write.parquet()

Re: Question on Spark shell

2016-07-11 Thread Sivakumaran S
That was my bad with the title. I am getting that output when I run my application, both from the IDE as well as in the console. I want the server logs itself displayed in the terminal from where I start the server. Right now, running the command ‘start-master.sh’ returns the prompt. I want

Re: Question on Spark shell

2016-07-11 Thread Anthony May
I see. The title of your original email was "Spark Shell" which is a Spark REPL environment based on the Scala Shell, hence why I misunderstood you. You should have the same output starting the application on the console. You are not seeing any output? On Mon, 11 Jul 2016 at 11:55 Sivakumaran S

Re: Question on Spark shell

2016-07-11 Thread Sivakumaran S
I am running a spark streaming application using Scala in the IntelliJ IDE. I can see the Spark output in the IDE itself (aggregation and stuff). I want the spark server logging (INFO, WARN, etc) to be displayed in screen when I start the master in the console. For example, when I start a kafka

Re: Question on Spark shell

2016-07-11 Thread Anthony May
Starting the Spark Shell gives you a Spark Context to play with straight away. The output is printed to the console. On Mon, 11 Jul 2016 at 11:47 Sivakumaran S wrote: > Hello, > > Is there a way to start the spark server with the log output piped to > screen? I am currently

Question on Spark shell

2016-07-11 Thread Sivakumaran S
Hello, Is there a way to start the spark server with the log output piped to screen? I am currently running spark in the standalone mode on a single machine. Regards, Sivakumaran - To unsubscribe e-mail:

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Mich Talebzadeh
Appreciate all the comments. Hive on Spark. Spark runs as an execution engine and is only used when you query Hive. Otherwise it is not running. I run it in Yarn client mode. let me show you an example In hive-site xml set the execution engine to be spark to spark. It requires some configuration

Using accumulators in Local mode for testing

2016-07-11 Thread harelglik
Hi, I am writing an app in Spark ( 1.6.1 ) in which I am using an accumulator. My accumulator is simply counting rows: acc += 1. My test processes 4 files each with 4 rows however the value of the accumulator in the end is not 16 and even worse is inconsistent between runs. Are accumulators not

Re: KEYS file?

2016-07-11 Thread Sean Owen
Yeah the canonical place for a project's KEYS file for ASF projects is http://www.apache.org/dist/{project}/KEYS and so you can indeed find this key among: http://www.apache.org/dist/spark/KEYS I'll put a link to this info on the downloads page because it is important info. On Mon, Jul 11,

What is the maximum number of column being supported by apache spark dataframe

2016-07-11 Thread Zijing Guo
Hi all, SPARK-Version: 1.5.2 with yarn 2.7.1.2.3.0.0-2557I'm running into a problem while I'm exploring the data through spark-shell that I'm trying to create a really fat dataframe that with 3000 columns. Code as below:val valueFunctionUDF = udf((valMap: Map[String, String], dataItemId:

Marking files as read in Spark Streaming

2016-07-11 Thread soumick dasgupta
Hi, I am looking for a solution in Spark Streaming where I can mark the files that I have already read in HDFS. This is to make sure that I am not reading the same file by mistake and also to ensure that I have read all the records in a given file. Thank You, Soumick

Re: Cluster mode deployment from jar in S3

2016-07-11 Thread Steve Loughran
the fact you are using s3:// URLs means that you are using EMR and it's S3 binding lib. Which means you are probably going to have to talk to the AWS team there. Though I'm surprised to see a jets3t stack trace there, as the AWS s3: client uses the amazon SDKs. S3n and s3a don't currently

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Michael Segel
Just a clarification. Tez is ‘vendor’ independent. ;-) Yeah… I know… Anyone can support it. Only Hortonworks has stacked the deck in their favor. Drill could be in the same boat, although there now more committers who are not working for MapR. I’m not sure who outside of HW is

WARN FileOutputCommitter: Failed to delete the temporary output directory of task: attempt_201607111453_128606_m_000000_0 - s3n://

2016-07-11 Thread Andy Davidson
I am running into serious performance problems with my spark 1.6 streaming app. As it runs it gets slower and slower. My app is simple. * It receives fairly large and complex JSON files. (twitter data) * Converts the RDD to DataFrame * Splits the data frame in to maybe 20 different data sets *

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Jörn Franke
I think llap should be in the future a general component so llap + spark can make sense. I see tez and spark not as competitors but they have different purposes. Hive+Tez+llap is not the same as hive+spark. I think it goes beyond that for interactive queries . Tez - you should use a

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Michael Segel
I don’t think that it would be a good comparison. If memory serves, Tez w LLAP is going to be running a separate engine that is constantly running, no? Spark? That runs under hive… Unless you’re suggesting that the spark context is constantly running as part of the hiveserver2? > On May

spark UI what does storage memory x/y mean

2016-07-11 Thread Andy Davidson
My stream app is running into problems It seems to slow down over time. How can I interpret the storage memory column. I wonder if I have a GC problem? Any idea how I can get GC stats? Thanks Andy Executors (3) * Memory: 9.4 GB Used (1533.4 MB Total) * Disk: 0.0 B Used Executor IDAddressRDD

Re: Isotonic Regression, run method overloaded Error

2016-07-11 Thread Fridtjof Sander
Spark's implementation does perform PAVA on each partition only to then collect each result to the driver and to perform PAVA again on the collected results. The hope of that is, that enough data is pooled, so that the the last step does not exceed the drivers memory limits. This assumption

Re: Isotonic Regression, run method overloaded Error

2016-07-11 Thread Yanbo Liang
IsotonicRegression can handle feature column of vector type. It will extract the a certain index (controlled by param "featureIndex") of this feature vector and feed it into model training. It will perform Pool adjacent violators algorithms on each partition, so it's distributed and the data is

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Mich Talebzadeh
The presentation will go deeper into the topic. Otherwise some thoughts of mine. Fell free to comment. criticise :) 1. I am a member of Spark Hive and Tez user groups plus one or two others 2. Spark is by far the biggest in terms of community interaction 3. Tez, typically one thread in

Spark hangs at "Removed broadcast_*"

2016-07-11 Thread velvetbaldmime
Spark 2.0.0-preview We've got an app that uses a fairly big broadcast variable. We run this on a big EC2 instance, so deployment is in client-mode. Broadcasted variable is a massive Map[String, Array[String]]. At the end of saveAsTextFile, the output in the folder seems to be complete and

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Ashok Kumar
Hi Mich, Your recent presentation in London on this topic "Running Spark on Hive or Hive on Spark" Have you made any more interesting findings that you like to bring up? If Hive is offering both Spark and Tez in addition to MR, what stopping one not to use Spark? I still don't get why TEZ + LLAP

Re: Isotonic Regression, run method overloaded Error

2016-07-11 Thread Fridtjof Sander
Hi Swaroop, from my understanding, Isotonic Regression is currently limited to data with 1 feature plus weight and label. Also the entire data is required to fit into memory of a single machine. I did some work on the latter issue but discontinued the project, because I felt no one really

Spark job state is EXITED but does not return

2016-07-11 Thread Balachandar R.A.
Hello, I have one apache spark based simple use case that process two datasets. Each dataset takes about 5-7 min to process. I am doing this processing inside the sc.parallelize(datasets){ } block. While the first dataset is processed successfully, the processing of dataset is not started by

Re: Connection via JDBC to Oracle hangs after count call

2016-07-11 Thread Chanh Le
Hi Mich, If I have a stored procedure in Oracle write like this SP get Info: PKG_ETL.GET_OBJECTS_INFO( p_LAST_UPDATED VARCHAR2, p_OBJECT_TYPE VARCHAR2, p_TABLE OUT SYS_REFCURSOR); How to call in Spark because the output is cursor p_TABLE OUT SYS_REFCURSOR. Thanks.

Re: Connection via JDBC to Oracle hangs after count call

2016-07-11 Thread Mark Vervuurt
Thanks Mich, we have got it working using the example here under ;) Mark > On 11 Jul 2016, at 09:45, Mich Talebzadeh wrote: > > Hi Mark, > > Hm. It should work. This is Spark 1.6.1 on Oracle 12c > > > scala> val HiveContext = new

question about UDAF

2016-07-11 Thread luohui20001
hello guys: I have a DF and a UDAF. this DF has 2 columns, lp_location_id , id, both are of Int type. I want to group by id and aggregate all value of id into 1 string. So I used a UDAF to do this transformation: multi Int values to 1 String. However my UDAF returns empty values as the

Re: Zeppelin Spark with Dynamic Allocation

2016-07-11 Thread Chanh Le
Hi Tamas, I am using Spark 1.6.1. > On Jul 11, 2016, at 3:24 PM, Tamas Szuromi wrote: > > Hello, > > What spark version do you use? I have the same issue with Spark 1.6.1 and > there is a ticket somewhere. > > cheers, > > > > > Tamas Szuromi > Data Analyst >

Re: Zeppelin Spark with Dynamic Allocation

2016-07-11 Thread Tamas Szuromi
Hello, What spark version do you use? I have the same issue with Spark 1.6.1 and there is a ticket somewhere. cheers, Tamas Szuromi Data Analyst *Skype: *tromika *E-mail: *tamas.szur...@odigeo.com [image: ODIGEO Hungary] ODIGEO Hungary Kft. 1066 Budapest Weiner Leó u.

Re: StreamingKmeans Spark doesn't work at all

2016-07-11 Thread Biplob Biswas
Hi Shuai, Thanks for the reply, I mentioned in the mail that I tried running the scala example as well from the link I provided and the result is the same. Thanks & Regards Biplob Biswas On Mon, Jul 11, 2016 at 5:52 AM, Shuai Lin wrote: > I would suggest you run the

Re: Connection via JDBC to Oracle hangs after count call

2016-07-11 Thread Mich Talebzadeh
Hi Mark, Hm. It should work. This is Spark 1.6.1 on Oracle 12c scala> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc) HiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@70f446c scala> var _ORACLEserver : String =

Re: Connection via JDBC to Oracle hangs after count call

2016-07-11 Thread Mark Vervuurt
Hi Mich, sorry for bothering did you manage to solve your problem? We have a similar problem with Spark 1.5.2 using a JDBC connection with a DataFrame to an Oracle Database. Thanks, Mark > On 12 Feb 2016, at 11:45, Mich Talebzadeh > wrote: >

Re: How to run Zeppelin and Spark Thrift Server Together

2016-07-11 Thread ayan guha
Hi When you say "Zeppelin and STS", I am assuming you mean "Spark Interpreter" and "JDBC interpreter" respectively. Through Zeppelin, you can either run your own spark application (by using Zeppelin's own spark context) using spark interpreter OR you can access STS, which is a spark application