Re: Should it be safe to embed Spark in Local Mode?

2016-07-19 Thread Holden Karau
That's interesting and might be better suited to the dev list. I know in some cases System exit off -1 were added so the task would be marked as failure. On Tuesday, July 19, 2016, Brett Randall wrote: > This question is regarding >

Re: Is it good choice to use DAO to store results generated by spark application?

2016-07-19 Thread Deepak Sharma
I am using DAO in spark application to write the final computation to Cassandra and it performs well. What kinds of issues you foresee using DAO for hbase ? Thanks Deepak On 19 Jul 2016 10:04 pm, "Yu Wei" wrote: > Hi guys, > > > I write spark application and want to store

Extremely slow shuffle writes and large job time fluxuations

2016-07-19 Thread Jon Chase
I'm running into an issue with a pyspark job where I'm sometimes seeing extremely variable job times (20min to 2hr) and very long shuffle times (e.g. ~2 minutes for 18KB/86 records). Cluster set up is Amazon EMR 4.4.0, Spark 1.6.0, an m4.2xl driver and a single m4.10xlarge (40 vCPU, 160GB)

Re: Storm HDFS bolt equivalent in Spark Streaming.

2016-07-19 Thread Deepak Sharma
In spark streaming , you have to decide the duration of micro batches to run. Once you get the micro batch , transform it as per your logic and then you can use saveAsTextFiles on your final RDD to write it to HDFS. Thanks Deepak On 20 Jul 2016 9:49 am, wrote:

Storm HDFS bolt equivalent in Spark Streaming.

2016-07-19 Thread Rajesh_Kalluri
Dell - Internal Use - Confidential Dell - Internal Use - Confidential While writing to Kafka from Storm, the hdfs bolt provides a nice way to batch the messages , rotate files, file name convention etc as shown below. Do you know of something similar in Spark Streaming or do we have to roll our

Re: Is it good choice to use DAO to store results generated by spark application?

2016-07-19 Thread Ted Yu
hbase-spark module is in the up-coming hbase 2.0 release. Currently it is in master branch of hbase git repo. FYI On Tue, Jul 19, 2016 at 8:27 PM, Andrew Ehrlich wrote: > There is a Spark<->HBase library that does this. I used it once in a > prototype (never tried in

Re: Spark Job trigger in production

2016-07-19 Thread Andrew Ehrlich
Another option is Oozie with the spark action: https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html > On Jul 18, 2016, at 12:15 AM, Jagat Singh wrote: > > You can use following options > > *

Re: the spark job is so slow - almost frozen

2016-07-19 Thread Andrew Ehrlich
Try: - filtering down the data as soon as possible in the job, dropping columns you don’t need. - processing fewer partitions of the hive tables at a time - caching frequently accessed data, for example dimension tables, lookup tables, or other datasets that are repeatedly accessed - using the

Re: Is it good choice to use DAO to store results generated by spark application?

2016-07-19 Thread Andrew Ehrlich
There is a Spark<->HBase library that does this. I used it once in a prototype (never tried in production through): http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/

Re: Saving a pyspark.ml.feature.PCA model

2016-07-19 Thread Ajinkya Kale
I am using google cloud dataproc which comes with spark 1.6.1. So upgrade is not really an option. No way / hack to save the models in spark 1.6.1 ? On Tue, Jul 19, 2016 at 8:13 PM Shuai Lin wrote: > It's added in not-released-yet 2.0.0 version. > >

Re: Heavy Stage Concentration - Ends With Failure

2016-07-19 Thread Andrew Ehrlich
Yea this is a good suggestion; also check 25th percentile, median, and 75th percentile to see how skewed the input data is. If you find that the RDD’s partitions are skewed you can solve it either by changing the partitioner when you read the files like already suggested, or call repartition()

Re: Filtering RDD Using Spark.mllib's ChiSqSelector

2016-07-19 Thread Tobi Bosede
Thanks Yanbo, will try that! On Sun, Jul 17, 2016 at 10:26 PM, Yanbo Liang wrote: > Hi Tobi, > > Thanks for clarifying the question. It's very straight forward to convert > the filtered RDD to DataFrame, you can refer the following code snippets: > > from pyspark.sql import

Re: Saving a pyspark.ml.feature.PCA model

2016-07-19 Thread Shuai Lin
It's added in not-released-yet 2.0.0 version. https://issues.apache.org/jira/browse/SPARK-13036 https://github.com/apache/spark/commit/83302c3b so i guess you need to wait for 2.0 release (or use the current rc4). On Wed, Jul 20, 2016 at 6:54 AM, Ajinkya Kale wrote: >

Re: spark worker continuously trying to connect to master and failed in standalone mode

2016-07-19 Thread Andrew Ehrlich
Troubleshooting steps: $ telnet localhost 7077 (on master, to confirm port is open) $ telnet 7077 (on slave, to confirm port is blocked) If the port is available on the master from the master, but not on the master from the slave, check firewall settings on the master:

Re: Dataframe Transformation with Inner fields in Complex Datatypes.

2016-07-19 Thread java bigdata
Hi Ayan, Thanks for your update. All i am trying is to update an inner field in one of the dataframe's complex type column. withColumn method adds or replaces existing column. In my case column is a nested column. Please see the below example i mentioned in the mail. I dont have to add a new

Should it be safe to embed Spark in Local Mode?

2016-07-19 Thread Brett Randall
This question is regarding https://issues.apache.org/jira/browse/SPARK-15685 (StackOverflowError (VirtualMachineError) or NoClassDefFoundError (LinkageError) should not System.exit() in local mode) and hopes to draw attention-to and discussion-on that issue. I have a product that is hosted as a

spark worker continuously trying to connect to master and failed in standalone mode

2016-07-19 Thread Neil Chang
Hi, I have two virtual pcs on private cloud (ubuntu 14). I installed spark 2.0 preview on both machines. I then tried to test it with standalone mode. I have no problem start the master. However, when I start the worker (slave) on another machine, it makes many attempts to connect to master and

HiveContext , difficulties in accessing tables in hive schema's/database's other than default database.

2016-07-19 Thread satyajit vegesna
Hi All, I have been trying to access tables from other schema's , apart from default , to pull data into dataframe. i was successful in doing it using the default schema in hive database. But when i try any other schema/database in hive, i am getting below error.(Have also not seen any examples

Unsubscribe

2016-07-19 Thread sjk
- To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Heavy Stage Concentration - Ends With Failure

2016-07-19 Thread Kuchekar
Hi, Can you check if the RDD is partitioned correctly with correct partition number (if you are manually setting the partition value.) . Try using Hash partitioner while reading the files. One way you can debug is by checking the number of records that executor has compared to others in the

Heavy Stage Concentration - Ends With Failure

2016-07-19 Thread Aaron Jackson
Hi, I have a cluster with 15 nodes of which 5 are HDFS nodes. I kick off a job that creates some 120 stages. Eventually, the active and pending stages reduce down to a small bottleneck and it never fails... the tasks associated with the 10 (or so) running tasks are always allocated to the same

Re: Missing Exector Logs From Yarn After Spark Failure

2016-07-19 Thread ayan guha
If YARN log aggregation is enabled then logs will be moved to HDFS. You can use yarn logs -applicationId to view those logs. On Wed, Jul 20, 2016 at 8:58 AM, Ted Yu wrote: > What's the value for yarn.log-aggregation.retain-seconds > and yarn.log-aggregation-enable ? > >

Re: Little idea needed

2016-07-19 Thread ayan guha
Well this one keeps cropping up in every project especially when hadoop implemented alongside MPP. For the fact, there is no reliable out of box update operation available in hdfs or hive or SPARK. Hence, one approach is what Mitch suggested, that do not update. Rather just keep all source

Re: Missing Exector Logs From Yarn After Spark Failure

2016-07-19 Thread Ted Yu
What's the value for yarn.log-aggregation.retain-seconds and yarn.log-aggregation-enable ? Which hadoop release are you using ? Thanks On Tue, Jul 19, 2016 at 3:23 PM, Rachana Srivastava < rachana.srivast...@markmonitor.com> wrote: > I am trying to find the root cause of recent Spark

Saving a pyspark.ml.feature.PCA model

2016-07-19 Thread Ajinkya Kale
Is there a way to save a pyspark.ml.feature.PCA model ? I know mllib has that but mllib does not have PCA afaik. How do people do model persistence for inference using the pyspark ml models ? Did not find any documentation on model persistency for ml. --ajinkya

Missing Exector Logs From Yarn After Spark Failure

2016-07-19 Thread Rachana Srivastava
I am trying to find the root cause of recent Spark application failure in production. When the Spark application is running I can check NodeManager's yarn.nodemanager.log-dir property to get the Spark executor container logs. The container has logs for both the running Spark applications Here

Re: Role-based S3 access outside of EMR

2016-07-19 Thread Andy Davidson
Hi Everett I always do my initial data exploration and all our product development in my local dev env. I typically select a small data set and copy it to my local machine My main() has an optional command line argument Œ- - runLocal¹ Normally I load data from either hdfs:/// or S3n:// . If the

Re: Task not serializable: java.io.NotSerializableException: org.json4s.Serialization$$anon$1

2016-07-19 Thread RK Aduri
Did you check this: case class Example(name : String, age ; Int) there is a semicolon. should have been (age : Int) -- View this message in context:

Role-based S3 access outside of EMR

2016-07-19 Thread Everett Anderson
Hi, When running on EMR, AWS configures Hadoop to use their EMRFS Hadoop FileSystem implementation for s3:// URLs and seems to install the necessary S3 credentials properties, as well. Often, it's nice during development to run outside of a cluster even with the "local" Spark master, though,

Re: spark-submit local and Akka startup timeouts

2016-07-19 Thread Bryan Cutler
The patch I was referring to doesn't help on the ActorSystem startup unfortunately. As best I can tell the property "akka.remote.startup-timeout" is what controls this timeout. You can try setting this to something greater in your Spark conf and hopefully that would work. Otherwise you might

Re: Task not serializable: java.io.NotSerializableException: org.json4s.Serialization$$anon$1

2016-07-19 Thread joshuata
It looks like the problem is that the parse function non-serializeable. This is most likely because the formats variable is local to the ParseJson object, and therefore not globally accessible to the cluster. Generally this problem can be solved by moving the variable inside the closure so that it

Re: Little idea needed

2016-07-19 Thread Mich Talebzadeh
Well this is a classic. The initial load can be done through Sqoop (outside of Spark) or through JDBC connection in Spark. 10 million rows in nothing. Then you have to think of updates and deletes in addition to new rows. With Sqoop you can load from the last ID in the source table, assuming

Re: Little idea needed

2016-07-19 Thread Jörn Franke
Well as far as I know there is some update statement planned for spark, but not sure which release. You could alternatively use Hive+Orc. Another alternative would be to add the deltas in a separate file and when accessing the table filtering out the double entries. From time to time you could

Re: I'm trying to understand how to compile Spark

2016-07-19 Thread Jacek Laskowski
Hi, hadoop-2.7 would be more fresh. You don't need hadoop.version when the defaults are fine. 2.7.2 for hadoop-2.7 profile. Jacdk On 19 Jul 2016 6:09 p.m., "Jakob Odersky" wrote: > Hi Eli, > > to build spark, just run > > build/mvn -Pyarn -Phadoop-2.6

Little idea needed

2016-07-19 Thread Aakash Basu
Hi all, I'm trying to pull a full table from oracle, which is huge with some 10 million records which will be the initial load to HDFS. Then I will do delta loads everyday in the same folder in HDFS. Now, my query here is, DAY 0 - I did the initial load (full dump). DAY 1 - I'll load only

Re: Presentation in London: Running Spark on Hive or Hive on Spark

2016-07-19 Thread Ashok Kumar
Thanks Mich looking forward to it :) On Tuesday, 19 July 2016, 19:13, Mich Talebzadeh wrote: Hi all, This will be in London tomorrow Wednesday 20th July starting at 18:00 hour for refreshments and kick off at 18:30, 5 minutes walk from Canary Wharf Station,

Re: Execute function once on each node

2016-07-19 Thread Rabin Banerjee
" I am working on a spark application that requires the ability to run a function on each node in the cluster " -- Use Apache Ignite instead of Spark. Trust me it's awesome for this use case. Regards, Rabin Banerjee On Jul 19, 2016 3:27 AM, "joshuata" wrote: > I am

Strange behavior including memory leak and NPE

2016-07-19 Thread rachmaninovquartet
Hi, I've been fighting with a strange situation today. I'm trying to add two entries for each of the distinct rows of an account, except for the first and last (by date). Here's an example of some of the code. I can't get the subset to continue forward: var acctIdList =

Re: how to setup the development environment of spark with IntelliJ on ubuntu

2016-07-19 Thread joshuata
I have found the easiest way to set up a development platform is to use the databricks sbt-spark-package plugin (assuming you are using scala+sbt). You simply add the plugin to your /project/plugins.sbt file and add the sparkVersion to your

Re: Spark streaming takes longer time to read json into dataframes

2016-07-19 Thread Diwakar Dhanuskodi
Okay got it regarding parallelism that  you  are  saying . Yes , We use dataframe to infer schema and process data. The json schema has xml data as one of key value pair.  Xml data needs to be processed  in foreachRDD. Json schema doesn't  change often.  Regards, Diwakar Sent from Samsung

Re: Spark streaming takes longer time to read json into dataframes

2016-07-19 Thread Diwakar Dhanuskodi
Okay got that bout parallelism that  you  are  saying . Yes , We use dataframe to infer schema and process data. The json schema has xml data as one of key value pair.  Xml data needs to be processed  in foreachRDD. Json schema doesn't  change often.  Regards, Diwakar. Sent from Samsung

Re: Presentation in London: Running Spark on Hive or Hive on Spark

2016-07-19 Thread Mich Talebzadeh
Hi all, This will be in London tomorrow Wednesday 20th July starting at 18:00 hour for refreshments and kick off at 18:30, 5 minutes walk from Canary Wharf Station, Jubilee Line If you wish you can register and get more info here It will be in La

Re: Spark 7736

2016-07-19 Thread Holden Karau
Indeed there is, signup for an Apache JIRA account then then when you visit the JIRA page logged in you should see a "reopen issue" button. For issues like this (reopening a JIRA) - you might find the dev list to be more useful. On Wed, Jul 13, 2016 at 4:47 AM, ayan guha

Re: spark single PROCESS_LOCAL task

2016-07-19 Thread Holden Karau
So its possible that you have a lot of data in one of the partitions which is local to that process, maybe you could cache & count the upstream RDD and see what the input partitions look like? On the otherhand - using groupByKey is often a bad sign to begin with - can you rewrite your code to

Re: which one spark ml or spark mllib

2016-07-19 Thread Holden Karau
So Spark ML is going to be the actively developed Machine Learning library going forward, however back in Spark 1.5 it was still relatively new and an experimental component so not all of the the save/load support implemented for the same models. That being said for 2.0 ML doesn't have PMML export

Re: transtition SQLContext to SparkSession

2016-07-19 Thread Reynold Xin
Yes. But in order to access methods available only in HiveContext a user cast is required. On Tuesday, July 19, 2016, Maciej Bryński wrote: > @Reynold Xin, > How this will work with Hive Support ? > SparkSession.sqlContext return HiveContext ? > > 2016-07-19 0:26 GMT+02:00

Re: transtition SQLContext to SparkSession

2016-07-19 Thread Maciej Bryński
@Reynold Xin, How this will work with Hive Support ? SparkSession.sqlContext return HiveContext ? 2016-07-19 0:26 GMT+02:00 Reynold Xin : > Good idea. > > https://github.com/apache/spark/pull/14252 > > > > On Mon, Jul 18, 2016 at 12:16 PM, Michael Armbrust

Re: Spark driver getting out of memory

2016-07-19 Thread RK Aduri
Just want to see if this helps. Are you doing heavy collects and persist that? If that is so, you might want to parallelize that collection by converting to an RDD. Thanks, RK On Tue, Jul 19, 2016 at 12:09 AM, Saurav Sinha wrote: > Hi Mich, > >1. In what mode are

Is it good choice to use DAO to store results generated by spark application?

2016-07-19 Thread Yu Wei
Hi guys, I write spark application and want to store results generated by spark application to hbase. Do I need to access hbase via java api directly? Or is it better choice to use DAO similar as traditional RDBMS? I suspect that there is major performance downgrade and other negative

Re: Execute function once on each node

2016-07-19 Thread Josh Asplund
Technical limitations keep us from running another filesystem on the SSDs. We are running on a very large HPC cluster without control over low-level system components. We have tried setting up an ad-hoc HDFS cluster on the nodes in our allocation, but we have had very little luck. It ends up being

Re: Execute function once on each node

2016-07-19 Thread Josh Asplund
Thank you for that advice. I have tried similar techniques, but not that one. On Mon, Jul 18, 2016 at 11:42 PM Aniket Bhatnagar < aniket.bhatna...@gmail.com> wrote: > Thanks for the explanation. Try creating a custom RDD whose getPartitions > returns an array of custom partition objects of size

Re: Error in Word Count Program

2016-07-19 Thread Jakob Odersky
Does the file /home/user/spark-1.5.1-bin-hadoop2.4/bin/README.md exist? On Tue, Jul 19, 2016 at 4:30 AM, RK Spark wrote: > val textFile = sc.textFile("README.md")val linesWithSpark = > textFile.filter(line => line.contains("Spark")) >

Re: I'm trying to understand how to compile Spark

2016-07-19 Thread Jakob Odersky
Hi Eli, to build spark, just run build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests package in your source directory, where package is the actual word "package". This will recompile the whole project, so it may take a while when running the first time. Replacing a single file

Re: Is that possible to launch spark streaming application on yarn with only one machine?

2016-07-19 Thread Yu Wei
Thanks very much for your help. Finally I understood the deploy mode with your explanation after trying different approach on my development environment. Thanks again. From: Yu Wei Sent: Saturday, July 9, 2016 3:04:40 PM To: Rabin Banerjee

Re: Latest 200 messages per topic

2016-07-19 Thread Cody Koeninger
Unless you're using only 1 partition per topic, there's no reasonable way of doing this. Offsets for one topicpartition do not necessarily have anything to do with offsets for another topicpartition. You could do the last (200 / number of partitions) messages per topicpartition, but you have no

Re: Spark streaming takes longer time to read json into dataframes

2016-07-19 Thread Cody Koeninger
Yes, if you need more parallelism, you need to either add more kafka partitions or shuffle in spark. Do you actually need the dataframe api, or are you just using it as a way to infer the json schema? Inferring the schema is going to require reading through the RDD once before doing any other

Re: Building standalone spark application via sbt

2016-07-19 Thread Andrew Ehrlich
Yes, spark-core will depend on Hadoop and several other jars. Here’s the list of dependencies: https://github.com/apache/spark/blob/master/core/pom.xml#L35 Whether you need spark-sql depends on whether you will use the DataFrame

Building standalone spark application via sbt

2016-07-19 Thread Sachin Mittal
Hi, Can someone please guide me what all jars I need to place in my lib folder of the project to build a standalone scala application via sbt. Note I need to provide static dependencies and I cannot download the jars using libraryDependencies. So I need to provide all the jars upfront. So far I

Re: Execute function once on each node

2016-07-19 Thread Koert Kuipers
The whole point of a well designed global filesystem is to not move the data On Jul 19, 2016 10:07, "Koert Kuipers" wrote: > If you run hdfs on those ssds (with low replication factor) wouldn't it > also effectively write to local disk with low latency? > > On Jul 18, 2016

Re: I'm trying to understand how to compile Spark

2016-07-19 Thread Ted Yu
org.apache.spark.mllib.fpm is not a maven goal. -pl is For Individual Projects. Your first build action should not include -pl. On Tue, Jul 19, 2016 at 4:22 AM, Eli Super wrote: > Hi > > I have a windows laptop > > I just downloaded the spark 1.4.1 source code. > > I try

Re: Spark ResourceLeak??

2016-07-19 Thread Ted Yu
ResourceLeakDetector doesn't seem to be from Spark. Please check dependencies for potential leak. Cheers On Tue, Jul 19, 2016 at 6:11 AM, Guruji wrote: > I am running a Spark Cluster on Mesos. The module reads data from Kafka as > DirectStream and pushes it into

Spark ResourceLeak?

2016-07-19 Thread saurabh guru
I am running a Spark Cluster on Mesos. The module reads data from Kafka as DirectStream and pushes it into elasticsearch after referring to a redis for getting Names against IDs. I have been getting this message in my worker logs. *16/07/19 11:17:44 ERROR ResourceLeakDetector: LEAK: You are

Spark ResourceLeak??

2016-07-19 Thread Guruji
I am running a Spark Cluster on Mesos. The module reads data from Kafka as DirectStream and pushes it into elasticsearch after referring to a redis for getting Names against IDs. I have been getting this message in my worker logs. *16/07/19 11:17:44 ERROR ResourceLeakDetector: LEAK: You are

Error in Word Count Program

2016-07-19 Thread RK Spark
val textFile = sc.textFile("README.md")val linesWithSpark = textFile.filter(line => line.contains("Spark")) linesWithSpark.saveAsTextFile("output1") Same error: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/user/spark-1.5.1-bin-hadoop2.4/bin/README.md

I'm trying to understand how to compile Spark

2016-07-19 Thread Eli Super
Hi I have a windows laptop I just downloaded the spark 1.4.1 source code. I try to compile *org.apache.spark.mllib.fpm* with *mvn * My goal is to replace *original *org\apache\spark\mllib\fpm\* in *spark-assembly-1.4.1-hadoop2.6.0.jar* As I understand from this link

which one spark ml or spark mllib

2016-07-19 Thread pseudo oduesp
HI, i don't have any idea why we have to library ML and mlib ml you can use it with data frame and mllib with rdd but ml have some lakes like: save model most important if you want create web api with score my question why we don't have all features in MLlib on ML ? ( i use pyspark 1.5.0

Re: spark-submit local and Akka startup timeouts

2016-07-19 Thread Rory Waite
Sorry Bryan, I should have mentioned that I'm running 1.6.0 for hadoop2.6. The binaries were downloaded from the Spark website. We're free to upgrade to Spark, create custom builds, etc. Please let me know how to display the config property.

Scala code as "spark view"

2016-07-19 Thread wdaehn
Using Spark via the Thrift server is fine and good but it limits yourself to simple SQL queries. For all complex Spark logic you have to submit a job first, write the result into a table and then query the table. This has obviously the limitation that a) The user executing the query cannot pass

Re: Dynamically get value based on Map key in Spark Dataframe

2016-07-19 Thread Jacek Laskowski
Hi Divya, That's the right way to access a value for a key in a broadcast map. I'm pretty sure tough that you could do the same (or similar) with higher-level broadcast Datasets. Try it out! Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark

Re: Spark driver getting out of memory

2016-07-19 Thread Saurav Sinha
Hi Mich, 1. In what mode are you running the spark standalone, yarn-client, yarn cluster etc Ans: spark standalone 1. You have 4 nodes with each executor having 10G. How many actual executors do you see in UI (Port 4040 by default) Ans: There are 4 executor on which am using 8

Re: Execute function once on each node

2016-07-19 Thread Aniket Bhatnagar
Thanks for the explanation. Try creating a custom RDD whose getPartitions returns an array of custom partition objects of size n (= number of nodes). In a custom partition object, you can have the file path and ip/hostname where the partition needs to be computed. Then, have getPreferredLocations