Re: How to run Zeppelin and Spark Thrift Server Together

2016-07-12 Thread Chanh Le
Hi Ayan, Thank you for replying. But I wanna create a table in Zeppelin and store the metadata in Alluxio like I tried to do set hive.metastore.warehouse.dir=alluxio://master1:19998/metadb    <>So I can share data with STS. The way you’ve mentioned through JDBC I already did and it works but I

Spark Streaming: Refreshing broadcast value after each batch

2016-07-12 Thread Daniel Haviv
Hi, I have a streaming application which uses a broadcast variable which I populate from a database. I would like every once in a while (or even every batch) to update/replace the broadcast variable with the latest data from the database. Only way I found online to do this is this "hackish" way (

Re: Tools for Balancing Partitions by Size

2016-07-12 Thread Pedro Rodriguez
The primary goal for balancing partitions would be for the write to S3. We would like to prevent unbalanced partitions (can do with repartition), but also avoid partitions that are too small or too large. So for that case, getting the cache size would work Maropu if its roughly accurate, but

Re: Spark cache behaviour when the source table is modified

2016-07-12 Thread Chanh Le
Hi Anjali, The Cached is immutable you can’t update data into. They way to update cache is re-create cache. > On Jun 16, 2016, at 4:24 PM, Anjali Chadha wrote: > > Hi all, > > I am having a hard time understanding the caching concepts in Spark. > > I have a hive

Inode for STS

2016-07-12 Thread ayan guha
Hi We are running Spark Thrift Server as a long running application. However, it looks like it is filling up /tmp/hive folder with lots of small files and directories with no file in them, blowing out inode limit and preventing any connection with "No Space Left in Device" issue. What is the

?????? Spark hangs at "Removed broadcast_*"

2016-07-12 Thread Sea
please provide your jstack info. -- -- ??: "dhruve ashar";; : 2016??7??13??(??) 3:53 ??: "Anton Sviridov"; : "user"; : Re: Spark hangs at "Removed

Re: Large files with wholetextfile()

2016-07-12 Thread Hyukjin Kwon
Otherwise, please consider using https://github.com/databricks/spark-xml. Actually, there is a function to find the input file name, which is.. input_file_name function,

Re: Tools for Balancing Partitions by Size

2016-07-12 Thread Hatim Diab
Hi, Since the final size depends on data types and compression. I've had to first get a rough estimate of data, written to disk, then compute the number of partitions. partitions = int(ceil(size_data * conversion_ratio / block_size)) In my case block size 256mb, source txt & dest is snappy

Re: Tools for Balancing Partitions by Size

2016-07-12 Thread Takeshi Yamamuro
Hi, There is no simple way to access the size in a driver side. Since the partitions of primitive typed data (e.g., int) are compressed by `DataFrame#cache`, the actual size is possibly a little bit different from processing partitions size. // maropu On Wed, Jul 13, 2016 at 4:53 AM, Pedro

Re: Spark cluster tuning recommendation

2016-07-12 Thread Takeshi Yamamuro
Hi, Have you see a slide in spark summit 2016? https://spark-summit.org/2016/events/top-5-mistakes-when-writing-spark-applications/ This is a good slide for your capacity planning. // maropu On Tue, Jul 12, 2016 at 2:31 PM, Yash Sharma wrote: > I would say use the dynamic

Tools for Balancing Partitions by Size

2016-07-12 Thread Pedro Rodriguez
Hi, Are there any tools for partitioning RDD/DataFrames by size at runtime? The idea would be to specify that I would like for each partition to be roughly X number of megabytes then write that through to S3. I haven't found anything off the shelf, and looking through stack overflow posts doesn't

Re: bisecting kmeans model tree

2016-07-12 Thread roni
Hi Spark,Mlib experts, Anyone who can shine light on this? Thanks _R On Thu, Apr 21, 2016 at 12:46 PM, roni wrote: > Hi , > I want to get the bisecting kmeans tree structure to show a dendogram on > the heatmap I am generating based on the hierarchical clustering of

Re: QuantileDiscretizer not working properly with big dataframes

2016-07-12 Thread Pasquinell Urbani
In the forum mentioned above the flowing solution is suggested Problem is in line 113 and 114 of QuantileDiscretizer.scala and can be fixed by changing line 113 like so: before: val requiredSamples = math.max(numBins * numBins, 1) after: val requiredSamples = math.max(numBins * numBins,

Output Op Duration vs Job Duration: What's the difference?

2016-07-12 Thread Renxia Wang
Hi, I am using Spark 1.6.1 on EMR running a streaming app on YARN. From the Spark UI I see that for each batch, the *Output Op Duration* is larger than *Job Duration *(screenshot attached). What's the difference between these two, is the *Job Duration* only counts the executor time of each time,

Feature importance IN random forest

2016-07-12 Thread pseudo oduesp
Hi, i use pyspark 1.5.0 can i ask you how i can get feature imprtance for a randomforest algorithme in pyspark and please give me example thanks for advance.

Re: ml and mllib persistence

2016-07-12 Thread Reynold Xin
Also Java serialization isn't great for cross platform compatibility. On Tuesday, July 12, 2016, aka.fe2s wrote: > Okay, I think I found an answer on my question. Some models (for instance > org.apache.spark.mllib.recommendation.MatrixFactorizationModel) hold RDDs, > so just

Re: ml and mllib persistence

2016-07-12 Thread aka.fe2s
Okay, I think I found an answer on my question. Some models (for instance org.apache.spark.mllib.recommendation.MatrixFactorizationModel) hold RDDs, so just serializing these objects will not work. -- Oleksiy Dyagilev On Tue, Jul 12, 2016 at 5:40 PM, aka.fe2s wrote: > What

Re: RDD for loop vs foreach

2016-07-12 Thread Deepak Sharma
Hi Phil I guess for() is executed on the driver while foreach() will execute it in parallel. You can try this without collecting the rdd try both . foreach in this case would print on executors and you would not see anything on the driver console. Thanks Deepak On Tue, Jul 12, 2016 at 9:28 PM,

Re: location of a partition in the cluster/ how parallelize method distribute the RDD partitions over the cluster.

2016-07-12 Thread aka.fe2s
The local collection is distributed into the cluster when you run any action http://spark.apache.org/docs/latest/programming-guide.html#actions due to laziness of RDD. If you want to control how the collection is split into parititions, you can create your own RDD implementation and implement

Re: Matrix Factorization Model model.save error "NullPointerException"

2016-07-12 Thread Zhou (Joe) Xing
Anyone may have an idea on what this NPE issue below is about? Thank you! cheers zhou On Jul 11, 2016, at 11:27 PM, Zhou (Joe) Xing > wrote: Hi Guys, I searched for the archive and also googled this problem when saving the ALS trained Matrix

RDD for loop vs foreach

2016-07-12 Thread philipghu
Hi, I'm new to Spark and Scala as well. I understand that we can use foreach to apply a function to each element of an RDD, like rdd.foreach (x=>println(x)), but I saw we can also do a for loop to print each element of an RDD, like for (x <- rdd){ println(x) } Does defining the foreach

ml and mllib persistence

2016-07-12 Thread aka.fe2s
What is the reason Spark has an individual implementations of read/write routines for every model in mllib and ml? (Saveable and MLWritable trait impls) Wouldn't a generic implementation via Java serialization mechanism work? I would like to use it to store the models to a custom storage. --

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Mich Talebzadeh
I guess that is what DAG adds up to with Tez Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com

Re: Send real-time alert using Spark

2016-07-12 Thread Priya Ch
I mean model training on incoming data is taken care by Spark. For detected anomalies, need to send alert. Could we do this with Spark or any other framework like Akka/REST API would do it ? Thanks, Padma CH On Tue, Jul 12, 2016 at 7:30 PM, Marcin Tustin wrote: > Priya,

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Mich Talebzadeh
Hi Ayan, This is a very valid question and I have not seen any available instrumentation in Spark that allows one to measure this in a practical way in a cluster. Classic example: 1. if you have memory issue do you upgrade your RAM or scale out horizontally by adding couple of more

Send real-time alert using Spark

2016-07-12 Thread Priya Ch
Hi All, I am building Real-time Anomaly detection system where I am using k-means to detect anomaly. Now in-order to send alert to mobile or an email alert how do i send it using Spark itself ? Thanks, Padma CH

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Mich Talebzadeh
sorry I completely miss your points I was NOT talking about Exadata. I was comparing Oracle 12c caching with that of Oracle TimesTen. no one mentioned Exadata here and neither storeindex etc.. so if Tez is not MR with DAG could you give me an example of how it works. No opinions but relevant to

Re: Large files with wholetextfile()

2016-07-12 Thread Prashant Sharma
Hi Baahu, That should not be a problem, given you allocate sufficient buffer for reading. I was just working on implementing a patch[1] to support the feature for reading wholetextfiles in SQL. This can actually be slightly better approach, because here we read to offheap memory for holding

Large files with wholetextfile()

2016-07-12 Thread Bahubali Jain
Hi, We have a requirement where in we need to process set of xml files, each of the xml files contain several records (eg: data of record 1.. data of record 2.. Expected output is Since we needed file name as well in output ,we chose wholetextfile() . We had to go

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Jörn Franke
I think the comparison with Oracle rdbms and oracle times ten is not so good. There are times when the in-memory database of Oracle is slower than the rdbms (especially in case of Exadata) due to the issue that in-memory - as in Spark - means everything is in memory and everything is always

Re: Handling categorical variables in StreamingLogisticRegressionwithSGD

2016-07-12 Thread Sean Owen
Yeah, for this to work, you need to know the number of distinct values a categorical feature will take on, ever. Sometimes that's known, sometimes it's not. One option is to use an algorithm that can use categorical features directly, like decision trees. You could consider hashing your features

Re: KEYS file?

2016-07-12 Thread Steve Loughran
On 11 Jul 2016, at 04:48, Shuai Lin > wrote: at least links to the keys used to sign releases on the download page +1 for that. really all release keys for ASF projects should be signed by others in the project and the broader ASF

Re: Error in Spark job

2016-07-12 Thread Yash Sharma
Looks like the write to Aerospike is taking too long. Could you try writing the rdd directly to filesystem, skipping the Aerospike write. foreachPartition at WriteToAerospike.java:47, took 338.345827 s - Thanks, via mobile, excuse brevity. On Jul 12, 2016 8:08 PM, "Saurav Sinha"

Error in Spark job

2016-07-12 Thread Saurav Sinha
Hi, I am getting into an issue where job is running in multiple partition around 21000 parts. Setting Driver = 5G Executor memory = 10G Total executor core =32 It us falling when I am trying to write to aerospace earlier it is working fine. I am suspecting number of partition as reason.

Re: How to Register Permanent User-Defined-Functions (UDFs) in SparkSQL

2016-07-12 Thread Daniel Darabos
Hi Lokesh, There is no way to do that. SqlContext.newSession documentation says: Returns a SQLContext as new session, with separated SQL configurations, temporary tables, registered functions, but sharing the same SparkContext, CacheManager, SQLListener and SQLTab. You have two options:

Handling categorical variables in StreamingLogisticRegressionwithSGD

2016-07-12 Thread kundan kumar
Hi , I am trying to use StreamingLogisticRegressionwithSGD to build a CTR prediction model. The document : http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression mentions that the numFeatures should be *constant*. The problem that I am facing is : Since most

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Mich Talebzadeh
That is only a plan not what execution engine is doing. As I stated before Spark uses DAG + in-memory computing. MR is serial on disk. The key is the execution here or rather the execution engine. In general The standard MapReduce as I know reads the data from HDFS, apply map-reduce algorithm

回复:Re: question about UDAF

2016-07-12 Thread luohui20001
hi pedro thanks for your advices. I got my code working as below:code in main:val hc = new org.apache.spark.sql.hive.HiveContext(sc) val hiveTable = hc.sql("select lp_location_id,id,pv from house_id_pv_location_top50")val jsonArray = new JsonArray val middleResult =

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Mich Talebzadeh
This the whole idea. Spark uses DAG + IM, MR is classic This is for Hive on Spark hive> explain select max(id) from dummy_parquet; OK STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Spark Edges: Reducer 2 <- Map

Re: Spark hangs at "Removed broadcast_*"

2016-07-12 Thread Anton Sviridov
Hi. Here's the last few lines before it starts removing broadcasts: 16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task 'attempt_20160723_0009_m_003209_20886' to

Spark streaming graceful shutdown when running on yarn-cluster deploy-mode

2016-07-12 Thread Guy Harmach
Hi, I'm a newbie to spark, starting to work with Spark 1.5 using the Java API (about to upgrade to 1.6 soon). I am deploying a spark streaming application using spark-submit with yarn-cluster mode. What is the recommended way for performing graceful shutdown to the spark job? Already tried

Matrix Factorization Model model.save error "NullPointerException"

2016-07-12 Thread Zhou (Joe) Xing
Hi Guys, I searched for the archive and also googled this problem when saving the ALS trained Matrix Factorization Model to local file system using Model.save() method, I found some hints such as partition the model before saving, etc. But it does not seem to solve my problem. I’m always