Re: ROSE: Spark + R on the JVM.

2016-01-13 Thread Richard Siebeling
Hi David, the use case is that we're building a data processing system with an intuitive user interface where Spark is used as the data processing framework. We would like to provide a HTML user interface to R where the user types or copy-pastes his R code, the system should then send this R code

RE: spark job failure - akka error Association with remote system has failed

2016-01-13 Thread vivek.meghanathan
Identified the problem - the Cassandra seed ip we use was down! From: Vivek Meghanathan (WT01 - NEP) Sent: 13 January 2016 13:06 To: 'user@spark.apache.org' Subject: RE: spark job failure - akka error Association with remote system has failed I have used master_ip as ip

Re: Concurrent Read of Accumulator's Value

2016-01-13 Thread Ted Yu
One option is to use a NoSQL data store, such as hbase, for the two actions to exchange status information. Write to data store in action 1 and read from action 2. Cheers On Wed, Jan 13, 2016 at 2:20 AM, Kira wrote: > Hi, > > So i have an action on one RDD that is

Merging compatible schemas on Spark 1.6.0

2016-01-13 Thread emlyn
I have a series of directories on S3 with parquet data, all with compatible (but not identical) schemas. We verify that the schemas stay compatible when they evolve using org.apache.avro.SchemaCompatibility.checkReaderWriterCompatibility. On Spark 1.5, I could read these into a DataFrame with

Kafka Streaming and partitioning

2016-01-13 Thread ddav
Hi, I have the following use case: 1. Reference data stored in an RDD that is persisted and partitioned using a simple custom partitioner. 2. Input stream from kafka that uses the same partitioner algorithm as the ref data RDD - this partitioning is done in kafka. I am using kafka direct

Re: [KafkaRDD]: rdd.cache() does not seem to work

2016-01-13 Thread Tathagata Das
Can you just simplify the code and run a few counts to see if the cache is being used (later jobs are faster or not). In addition, use the Spark UI to see whether it is cached, see the DAG viz of the job to see whethr it is using the cached RDD or not (DAG will show a green vertex if RDD is

Concurrent Read of Accumulator's Value

2016-01-13 Thread Kira
Hi, So i have an action on one RDD that is relatively long, let's call it ac1; what i want to do is to execute another action (ac2) on the same RDD to see the evolution of the first one (ac1); for this end i want to use an accumulator and read it's value progressively to see the changes on it (on

Re: FPGrowth does not handle large result sets

2016-01-13 Thread Sean Owen
As I said in your JIRA, the collect() in question is bringing results back to the driver to return them. The assumption is that there aren't a vast number of frequent items. If they are, they aren't 'frequent' and your min support is too low. On Wed, Jan 13, 2016 at 12:43 AM, Ritu Raj Tiwari

Spark ignores SPARK_WORKER_MEMORY?

2016-01-13 Thread Barak Yaish
Hello, Although I'm setting SPARK_WORKER_MEMORY in spark-env.sh, looks like this setting is ignored. I can't find any indication at the scripts under bin/sbin that -Xms/-Xmx are set. If I ps the worker pid, it looks like memory set to 1G: [hadoop@sl-env1-hadoop1 spark-1.5.2-bin-hadoop2.6]$ ps

Co-Partitioned Joins

2016-01-13 Thread ddav
Hi, I am quite new to Spark and have some questions on joins and co-partitioning. Are the following assumptions correct. When a join takes place and one of the RDD's has been partitioned, does Spark make a best effort to execute the join for a specific partition where the partitioned data

Is it possible to use SparkSQL JDBC ThriftServer without Hive

2016-01-13 Thread angela.whelan
hi, I'm wondering if it is possible to use the SparkSQL JDBC ThriftServer without Hive? The reason I'm asking is that we are unsure about the speed of Hive with SparkSQL JDBC connectivity. I can't find any article online about using SparkSQL JDBC ThriftServer without Hive. Many thanks in

Read Accumulator value while running

2016-01-13 Thread Kira
Hi, So i have an action on one RDD that is relatively long, let's call it ac1; what i want to do is to execute another action (ac2) on the same RDD to see the evolution of the first one (ac1); for this end i want to use an accumulator and read it's value progressively to see the changes on it

Spark Cassandra Java Connector: records missing despite consistency=ALL

2016-01-13 Thread Dennis Birkholz
Hi together, we Cassandra to log event data and process it every 15 minutes with Spark. We are using the Cassandra Java Connector for Spark. Randomly our Spark runs produce too few output records because no data is returned from Cassandra for a several minutes window of input data. When

Re: ROSE: Spark + R on the JVM.

2016-01-13 Thread David Russell
Hi Richard, Thanks for providing the background on your application. > the user types or copy-pastes his R code, > the system should then send this R code (using ROSE) to R Unfortunately this type of ad hoc R analysis is not supported. ROSE supports the execution of any R function or script

Error in Spark Executors when trying to read HBase table from Spark with Kerberos enabled

2016-01-13 Thread Vinay Kashyap
Hi all, I am using *Spark 1.5.1 in YARN cluster mode in CDH 5.5.* I am trying to create an RDD by reading HBase table with kerberos enabled. I am able to launch the spark job to read the HBase table, but I notice that the executors launched for the job cannot proceed due to an issue with

Error connecting to temporary derby metastore used by Spark, when running multiple jobs on the same SparkContext

2016-01-13 Thread Deenar Toraskar
Hi I am using the spark-jobserver and see the following messages when a lot of jobs are submitted simultaneously to the same SparkContext. Any ideas as to what might cause this. [2016-01-13 13:12:11,753] ERROR com.jolbox.bonecp.BoneCP [] [akka://JobServer/user/context-supervisor/ingest-context]

Re: How to optimiz and make this code faster using coalesce(1) and mapPartitionIndex

2016-01-13 Thread unk1102
Hi thanks for the reply. Actually I cant share details as it is classified and pretty complex to understand as it is not general problem I am trying to solve related to database dynamic sql order execution. I need to use Spark as my other jobs which dont use coalesce uses spark. My source data is

Optimized way to multiply two large matrices and save output using Spark and Scala

2016-01-13 Thread Devi P.V
I want to multiply two large matrices (from csv files)using Spark and Scala and save output.I use the following code val rows=file1.coalesce(1,false).map(x=>{ val line=x.split(delimiter).map(_.toDouble) Vectors.sparse(line.length, line.zipWithIndex.map(e => (e._2,

Re: yarn-client: SparkSubmitDriverBootstrapper not found in yarn client mode (1.6.0)

2016-01-13 Thread Kevin Mellott
Lin - if you add "--verbose" to your original *spark-submit* command, it will let you know the location in which Spark is running. As Marcelo pointed out, this will likely indicate version 1.3, which may help you confirm if this is your problem. On Wed, Jan 13, 2016 at 12:06 PM, Marcelo Vanzin

Re: Kafka Streaming and partitioning

2016-01-13 Thread Cody Koeninger
The idea here is that the custom partitioner shouldn't actually get used for repartitioning the kafka stream (because that would involve a shuffle, which is what you're trying to avoid). You're just assigning a partitioner because you know how it already is partitioned. On Wed, Jan 13, 2016 at

Re: automatically unpersist RDDs which are not used for 24 hours?

2016-01-13 Thread Andrew Or
Hi Alex, Yes, you can set `spark.cleaner.ttl`: http://spark.apache.org/docs/1.6.0/configuration.html, but I would not recommend it! We are actually removing this property in Spark 2.0 because it has caused problems for many users in the past. In particular, if you accidentally use a variable

Re: SQL UDF problem (with re to types)

2016-01-13 Thread Ted Yu
Looks like BigDecimal was passed to your call() method. Can you modify your udf to see if using BigDecimal works ? Cheers On Wed, Jan 13, 2016 at 11:58 AM, raghukiran wrote: > While registering and using SQL UDFs, I am running into the following > problem: > > UDF

Re: Read Accumulator value while running

2016-01-13 Thread Andrew Or
Hi Kira, As you suspected, accumulator values are only updated after the task completes. We do send accumulator updates from the executors to the driver on periodic heartbeats, but these only concern internal accumulators, not the ones created by the user. In short, I'm afraid there is not

Re: Optimized way to multiply two large matrices and save output using Spark and Scala

2016-01-13 Thread Burak Yavuz
BlockMatrix.multiply is the suggested method of multiplying two large matrices. Is there a reason that you didn't use BlockMatrices? You can load the matrices and convert to and from RowMatrix. If it's in sparse format (i, j, v), then you can also use the CoordinateMatrix to load, BlockMatrix to

Sending large objects to specific RDDs

2016-01-13 Thread Daniel Imberman
I'm looking for a way to send structures to pre-determined partitions so that they can be used by another RDD in a mapPartition. Essentially I'm given and RDD of SparseVectors and an RDD of inverted indexes. The inverted index objects are quite large. My hope is to do a MapPartitions within the

Re: Sending large objects to specific RDDs

2016-01-13 Thread Ted Yu
Another approach is to store the objects in NoSQL store such as HBase. Looking up object should be very fast. Cheers On Wed, Jan 13, 2016 at 11:29 AM, Daniel Imberman wrote: > I'm looking for a way to send structures to pre-determined partitions so > that > they

automatically unpersist RDDs which are not used for 24 hours?

2016-01-13 Thread Alexander Pivovarov
Is it possible to automatically unpersist RDDs which are not used for 24 hours?

Re: Sending large objects to specific RDDs

2016-01-13 Thread Daniel Imberman
Thank you Ted! That sounds like it would probably be the most efficient (with the least overhead) way of handling this situation. On Wed, Jan 13, 2016 at 11:36 AM Ted Yu wrote: > Another approach is to store the objects in NoSQL store such as HBase. > > Looking up object

Re: How to make Dataset api as fast as DataFrame

2016-01-13 Thread Michael Armbrust
The focus of this release was to get the API out there and there's a lot of low hanging performance optimizations. That said, there is likely always going to be some cost of materializing objects. Another note, anytime your comparing performance its useful to include the output of explain so we

Re: Read Accumulator value while running

2016-01-13 Thread Daniel Imberman
Hi Kira, I'm having some trouble understanding your question. Could you please give a code example? >From what I think you're asking there are two issues with what you're looking to do. (Please keep in mind I could be totally wrong on both of these assumptions, but this is what I've been lead

Re: Best practice for retrieving over 1 million files from S3

2016-01-13 Thread Steve Loughran
use s3a://, especially on hadoop-2.7+. It uses the amazon libs and is faster for directory lookups than jets3t > On 13 Jan 2016, at 11:42, Darin McBeath wrote: > > I'm looking for some suggestions based on other's experiences. > > I currently have a job that I

Re: Long running jobs in CDH

2016-01-13 Thread Jorge Machado
Hi Jan, Oozie oder you can check the parameter —supervise option http://spark.apache.org/docs/latest/submitting-applications.html > On 11/01/2016, at 14:23, Jan Holmberg wrote: > > Hi, > any

Re: Best practice for retrieving over 1 million files from S3

2016-01-13 Thread Daniel Imberman
I guess my big question would be why do you have so many files? Is there no possibility that you can merge a lot of those files together before processing them? On Wed, Jan 13, 2016 at 11:59 AM Darin McBeath wrote: > Thanks for the tip, as I had not seen this before.

Re: Best practice for retrieving over 1 million files from S3

2016-01-13 Thread Daniel Imberman
Hi Darin, You should read this article. TextFile is very inefficient in S3. http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219 Cheers On Wed, Jan 13, 2016 at 11:43 AM Darin McBeath wrote: > I'm looking for some suggestions based on

Re: FPGrowth does not handle large result sets

2016-01-13 Thread Ritu Raj Tiwari
Thanks Sean! I'll start with higher support threshold and work my way down. On Wednesday, January 13, 2016 8:57 AM, Sean Owen wrote: You're looking for subsets of items that appear in at least 200 of 200,000 transactions, which could be a whole lot. Keep in mind

Re: How to make Dataset api as fast as DataFrame

2016-01-13 Thread Arkadiusz Bicz
Hi, Including query plan : DataFrame : == Physical Plan == SortBasedAggregate(key=[agreement#23], functions=[(MaxVectorAggFunction(values#3),mode=Final,isDistinct=false)], output=[agreement#23,maxvalues#27]) +- ConvertToSafe +- Sort [agreement#23 ASC], false, 0 +- TungstenExchange

Re: Kafka Streaming and partitioning

2016-01-13 Thread David D
Yep that's exactly what we want. Thanks for all the info Cody. Dave. On 13 Jan 2016 18:29, "Cody Koeninger" wrote: > The idea here is that the custom partitioner shouldn't actually get used > for repartitioning the kafka stream (because that would involve a shuffle, > which

Re: Serializing DataSets

2016-01-13 Thread Michael Armbrust
Yeah, thats the best way for now (note the conversion is purely logical so there is no cost of calling toDF()). We'll likely be combining the classes in Spark 2.0 to remove this awkwardness. On Tue, Jan 12, 2016 at 11:20 PM, Simon Hafner wrote: > What's the proper way to

Best practice for retrieving over 1 million files from S3

2016-01-13 Thread Darin McBeath
I'm looking for some suggestions based on other's experiences. I currently have a job that I need to run periodically where I need to read on the order of 1+ million files from an S3 bucket. It is not the entire bucket (nor does it match a pattern). Instead, I have a list of random keys that

SQL UDF problem (with re to types)

2016-01-13 Thread raghukiran
While registering and using SQL UDFs, I am running into the following problem: UDF registered: ctx.udf().register("Test", new UDF1() { /** * */ private static final long

Usage of SparkContext within a Web container

2016-01-13 Thread praveen S
Is use of SparkContext from a Web container a right way to process spark jobs or should we use spark-submit in a processbuilder? Are there any pros or cons of using SparkContext from a Web container..? How does zeppelin trigger spark jobs from the Web context?

[discuss] dropping Hadoop 2.2 and 2.3 support in Spark 2.0?

2016-01-13 Thread Reynold Xin
We've dropped Hadoop 1.x support in Spark 2.0. There is also a proposal to drop Hadoop 2.2 and 2.3, i.e. the minimal Hadoop version we support would be Hadoop 2.4. The main advantage is then we'd be able to focus our Jenkins resources (and the associated maintenance of Jenkins) to create builds

Re: spark job failure - akka error Association with remote system has failed

2016-01-13 Thread vivek.meghanathan
Mohammed, As i have mentioned in latest email, it was failing due to a communication issue with cassandra. Once i fixed that the issue is no more there. Regards, Vivek M From: Mohammed Guller Sent: Thursday, January 14, 2016 4:38 AM

答复: 答复: spark streaming context trigger invoke stop why?

2016-01-13 Thread Triones,Deng(vip.com)
What’s more, I am running a 7*24 hours job , so I won’t call System.exit() by myself. So I believe somewhere of the driver kill itself 发件人: 邓刚[技术中心] 发送时间: 2016年1月14日 15:45 收件人: 'Yogesh Mahajan' 抄送: user 主题: 答复: 答复: spark streaming context trigger invoke stop why? Thanks for your response,

答复: 答复: spark streaming context trigger invoke stop why?

2016-01-13 Thread Triones,Deng(vip.com)
Thanks for your response, ApplicationMaster is only for yarn mode. I am using standalone mode. Could you kindly please let me know where trigger the shutdown hook? 发件人: Yogesh Mahajan [mailto:ymaha...@snappydata.io] 发送时间: 2016年1月14日 12:42 收件人: 邓刚[技术中心] 抄送: user 主题: Re: 答复: spark streaming

Re: SparkContext SyntaxError: invalid syntax

2016-01-13 Thread Bryan Cutler
Hi Andrew, There are a couple of things to check. First, is Python 2.7 the default version on all nodes in the cluster or is it an alternate install? Meaning what is the output of this command "$> python --version" If it is an alternate install, you could set the environment variable

Re: SQL UDF problem (with re to types)

2016-01-13 Thread Ted Yu
Please take a look at sql/hive/src/test/java/org/apache/spark/sql/hive/aggregate/MyDoubleSum.java which shows a UserDefinedAggregateFunction that works on DoubleType column. sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java shows how it is registered. Cheers On Wed, Jan

RE: Is it possible to use SparkSQL JDBC ThriftServer without Hive

2016-01-13 Thread Mohammed Guller
Hi Angela, Yes, you can use Spark SQL JDBC/ThriftServer without Hive. Mohammed -Original Message- From: angela.whelan [mailto:angela.whe...@synchronoss.com] Sent: Wednesday, January 13, 2016 3:37 AM To: user@spark.apache.org Subject: Is it possible to use SparkSQL JDBC ThriftServer

RE: Spark ignores SPARK_WORKER_MEMORY?

2016-01-13 Thread Mohammed Guller
Barak, The SPARK_WORKER_MEMORYsetting is used for allocating memory to executors. You can use SPARK_DAEMON_MEMORY to set memory for the worker JVM. Mohammed From: Barak Yaish [mailto:barak.ya...@gmail.com] Sent: Wednesday, January 13, 2016 12:59 AM To: user@spark.apache.org Subject: Spark

Random Forest FeatureImportance throwing NullPointerException

2016-01-13 Thread Rachana Srivastava
I have a Random forest model for which I am trying to get the featureImportance vector. Map categoricalFeaturesParam = new HashMap<>(); scala.collection.immutable.Map categoricalFeatures = (scala.collection.immutable.Map)

Re: trouble calculating TF-IDF data type mismatch: '(tf * idf)' requires numeric type, not vector;

2016-01-13 Thread Andy Davidson
you¹ll need the following function if you want to run the test code Kind regards Andy private DataFrame createData(JavaRDD rdd) { StructField id = null; id = new StructField("id", DataTypes.IntegerType, false, Metadata.empty()); StructField label = null;

trouble calculating TF-IDF data type mismatch: '(tf * idf)' requires numeric type, not vector;

2016-01-13 Thread Andy Davidson
Bellow is a little snippet of my Java Test Code. Any idea how I implement member wise vector multiplication? Also notice the idf value for ŒChinese¹ is 0.0? The calculation is ln((4+1) / (6/4 + 1)) = ln(2) = 0.6931 ?? Also any idea if this code would work in a pipe line? I.E. Is the pipeline

Re: SparkContext SyntaxError: invalid syntax

2016-01-13 Thread Andrew Weiner
Thanks for your continuing help. Here is some additional info. *OS/architecture* output of *cat /proc/version*: Linux version 2.6.18-400.1.1.el5 (mockbu...@x86-012.build.bos.redhat.com) output of *lsb_release -a*: LSB Version:

Re: Exception in Spark-sql insertIntoJDBC command

2016-01-13 Thread RichG
Hi All, I am having the same issue in Spark 1.5.1. I've tried through Scala and Python and neither the 'append' or 'overwrite' mode allow me to insert rows from a dataframe into an existing MS SQL Server 2012 table. When the table doesn't exist, this works fine and creates the table with the

RE: spark job failure - akka error Association with remote system has failed

2016-01-13 Thread Mohammed Guller
Check the entries in your /etc/hosts file. Also check what the hostname command returns. Mohammed From: vivek.meghanat...@wipro.com [mailto:vivek.meghanat...@wipro.com] Sent: Tuesday, January 12, 2016 11:36 PM To: user@spark.apache.org Subject: RE: spark job failure - akka error Association

Re: 答复: spark streaming context trigger invoke stop why?

2016-01-13 Thread Yogesh Mahajan
Hi Triones, Check the org.apache.spark.util.ShutdownHookManager : It adds this ShutDownHook when you start a StreamingContext Here is the code in StreamingContext.start() shutdownHookRef = ShutdownHookManager.addShutdownHook( StreamingContext.SHUTDOWN_HOOK_PRIORITY)(stopOnShutdown)

Re: Manipulate Twitter Stream Filter on runtime

2016-01-13 Thread Yogesh Mahajan
Hi Alem, I haven't tried it, but can you give a try and TwitterStream.clenup and add your modified filter if it works ? I am using twitter4j 4.0.4 with spark streaming. Regards, Yogesh Mahajan SnappyData Inc, snappydata.io On Mon, Jan 11, 2016 at 6:43 PM, Filli Alem wrote:

Hive is unable to avro file written by spark avro

2016-01-13 Thread Siva
Hi Everyone, Avro data written by dataframe in hdfs in not able to read by hive. Saving data avro format with below statement. df.save("com.databricks.spark.avro", SaveMode.Append, Map("path" -> path)) Created hive avro external table and while reading I see all nulls. Did anyone face similar

答复: spark streaming context trigger invoke stop why?

2016-01-13 Thread Triones,Deng(vip.com)
More info I am using spark version 1.5.2 发件人: Triones,Deng(vip.com) [mailto:triones.d...@vipshop.com] 发送时间: 2016年1月14日 11:24 收件人: user 主题: spark streaming context trigger invoke stop why? Hi all As I saw the driver log, the task failed 4 times in a stage, the stage will be dropped

spark streaming context trigger invoke stop why?

2016-01-13 Thread Triones,Deng(vip.com)
Hi all As I saw the driver log, the task failed 4 times in a stage, the stage will be dropped when the input block was deleted before make use of. After that the StreamingContext invoke stop. Does anyone know what kind of akka message trigger the stop or which code trigger the

[Spark Streaming] "Could not compute split, block input-0-1452563923800 not found” when trying to recover from checkpoint data

2016-01-13 Thread Collin Shi
Hi I was doing a simple updateByKey transformation and print on the data received from socket, and spark version is 1.4.0. The first submit went all right, but after I kill (CTRL + C) the job and submit again. Apparently spark was trying to recover from the checkpoint data , but then the

Re: Hive is unable to avro file written by spark avro

2016-01-13 Thread Kevin Mellott
Hi Sivakumar, I have run into this issue in the past, and we were able to fix it by using an explicit schema when saving the DataFrame to the Avro file. This schema was an exact match to the one associated with the metadata on the Hive database table, which allowed the Hive queries to work even

Re: SQL UDF problem (with re to types)

2016-01-13 Thread Raghu Ganti
So, when I try BigDecimal, it works. But, should it not parse based on what the UDF defines? Am I missing something here? On Wed, Jan 13, 2016 at 4:57 PM, Ted Yu wrote: > Please take a look > at sql/hive/src/test/java/org/apache/spark/sql/hive/aggregate/MyDoubleSum.java >

Re: 答复: spark streaming context trigger invoke stop why?

2016-01-13 Thread Yogesh Mahajan
All the action happens in ApplicationMaster expecially in run method Check ApplicationMaster#startUserApplication : userThread(Driver) which invokes ApplicationMaster#finish method. You can also try System.exit in your program Regards, Yogesh Mahajan, SnappyData Inc, snappydata.io On Thu, Jan

Spark on YARN job continuously reports "Application does not exist in cache"

2016-01-13 Thread Prabhu Joseph
Hi All, When we submit Spark jobs on YARN, during RM failover, we see lot of jobs reporting below error messages. *2016-01-11 09:41:06,682 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Unregistering app attempt : appattempt_1450676950893_0280_01* 2016-01-11

How to get the working directory in executor

2016-01-13 Thread Byron Wang
I am using the following command to submit Spark job, I hope to send jar and config files to each executor and load it there spark-submit --verbose \ --files=/tmp/metrics.properties \ --jars /tmp/datainsights-metrics-source-aembly-1.0.jar \ --total-executor-cores 4\ --conf

Re: Kafka Streaming and partitioning

2016-01-13 Thread Cody Koeninger
If two rdds have an identical partitioner, joining should not involve a shuffle. You should be able to override the partitioner without calling partitionBy. Two ways I can think of to do this: - subclass or modify the direct stream and kafkardd. They're private, so you'd need to rebuild just

Spark 1.6 and Application History not working correctly

2016-01-13 Thread Darin McBeath
I tried using Spark 1.6 in a stand-alone cluster this morning. I submitted 2 jobs (and they both executed fine). In fact, they are the exact same jobs with just some different parameters. I was able to view the application history for the first job. However, when I tried to view the second

Re: ml.classification.NaiveBayesModel how to reshape theta

2016-01-13 Thread Yanbo Liang
Yep, row of Matrix theta is the number of classes and column of theta is the number of features. 2016-01-13 10:47 GMT+08:00 Andy Davidson : > I am trying to debug my trained model by exploring theta > Theta is a Matrix. The java Doc for Matrix says that it is

How to make Dataset api as fast as DataFrame

2016-01-13 Thread Arkadiusz Bicz
Hi, I have done some performance tests by repeating execution with different number of executors and memory for YARN clustered Spark (version 1.6.0) ( cluster contains 6 large size nodes) I found Dataset joinWith or cogroup from 3 to 5 times slower then broadcast join in DataFrame, how to

Need 'Learning Spark' Partner

2016-01-13 Thread King sami
Hi, As I'm beginner in Spark, I'm looking for someone who's also beginner to learn and train on Spark together. Please contact me if interested Cordially,

Re: How to get the working directory in executor

2016-01-13 Thread Ted Yu
Can you place metrics.properties and datainsights-metrics-source-assembly-1.0.jar on hdfs ? Cheers On Wed, Jan 13, 2016 at 8:01 AM, Byron Wang wrote: > I am using the following command to submit Spark job, I hope to send jar > and > config files to each executor and load it

Re: Spark 1.6 and Application History not working correctly

2016-01-13 Thread Steve Loughran
> On 13 Jan 2016, at 07:15, Darin McBeath wrote: > > Thanks. > > I already set the following in spark-defaults.conf so I don't think that is > going to fix my problem. > > spark.eventLog.dir file:///root/spark/applicationHistory > spark.eventLog.enabled true

distributeBy using advantage of HDFS or RDD partitioning

2016-01-13 Thread Deenar Toraskar
Hi I have data in HDFS partitioned by a logical key and would like to preserve the partitioning when creating a dataframe for the same. Is it possible to create a dataframe that preserves partitioning from HDFS or the underlying RDD? Regards Deenar

Re: Spark 1.6 and Application History not working correctly

2016-01-13 Thread Darin McBeath
Thanks. I already set the following in spark-defaults.conf so I don't think that is going to fix my problem. spark.eventLog.dir file:///root/spark/applicationHistory spark.eventLog.enabled true I suspect my problem must be something else. Darin. From: Don

Spark Thrift Server 2 problem

2016-01-13 Thread Бобров Виктор
Hi, I’m trying to connect my tableau to spark sql. Using this guide https://community.tableau.com/docs/DOC-7638. Then I’m try start thrift server (or $SPARK_HOME/sbin/start-thriftserver.sh --master spark://data01:7077 --driver-class-path $CLASSPATH --hiveconf hive.server2.thrift.bind.host

Re: Spark 1.6 and Application History not working correctly

2016-01-13 Thread Don Drake
I noticed a similar problem going from 1.5.x to 1.6.0 on YARN. I resolved it be setting the following command-line parameters: spark.eventLog.enabled=true spark.eventLog.dir= -Don On Wed, Jan 13, 2016 at 8:29 AM, Darin McBeath wrote: > I tried using Spark 1.6 in

Re: Kafka Streaming and partitioning

2016-01-13 Thread Dave
Thanks Cody, appreciate the response. With this pattern the partitioners will now match when the join is executed. However, does the wrapper RDD not need to set the partition meta data on the wrapped RDD in order to allow Spark to know where the data for each partition resides in the cluster.

Re: Kafka Streaming and partitioning

2016-01-13 Thread Cody Koeninger
In the case here of a kafkaRDD, the data doesn't reside on the cluster, it's not cached by default. If you're running kafka on the same nodes as spark, then data locality would play a factor, but that should be handled by the existing getPreferredLocations method. On Wed, Jan 13, 2016 at 10:46

Re: distributeBy using advantage of HDFS or RDD partitioning

2016-01-13 Thread Simon Elliston Ball
If you load data using ORC or parquet, the RDD will have a partition per file, so in fact your data frame will not directly match the partitioning of the table. If you want to process by and guarantee preserving partitioning then mapPartition etc will be useful. Note that if you perform any

Re: FPGrowth does not handle large result sets

2016-01-13 Thread Ritu Raj Tiwari
Hi Sean:Thanks for checking out my question here. Its possible I am making a newbie error. Based on my dataset of about 200,000 transactions and a minimum support level of 0.001, I am looking for items that appear at least 200 times. Given that the items in my transactions are drawn from a set

Re: FPGrowth does not handle large result sets

2016-01-13 Thread Sean Owen
You're looking for subsets of items that appear in at least 200 of 200,000 transactions, which could be a whole lot. Keep in mind there are 25,000 items, sure, but already 625,000,000 possible pairs of items, and trillions of possible 3-item subsets. This sounds like it's just far too low. Start

Running window functions in spark dataframe

2016-01-13 Thread rakesh sharma
Hi all I am getting hivecontext error when trying to run to run window functions like over on ordering clause. Any help to go about. I am running spark locally Sent from Ouertlook Mobile -- Forwarded message -- From: "King sami"

Re: How to get the working directory in executor

2016-01-13 Thread Ted Yu
In a bit more detail: You upload the files using 'hdfs dfs -copyFromLocal' command Then specify hdfs location of the files on the command line. Cheers On Wed, Jan 13, 2016 at 8:05 AM, Ted Yu wrote: > Can you place metrics.properties and >

Re: PCA OutOfMemoryError

2016-01-13 Thread Alex Gittens
The PCA.fit function calls the RowMatrix PCA routine, which attempts to construct the covariance matrix locally on the driver, and then computes the SVD of that to get the PCs. I'm not sure what's causing the memory error: RowMatrix.scala:124 is only using 3.5 GB of memory (n*(n+1)/2 with n=29604

Re: Spark SQL UDF with Struct input parameters

2016-01-13 Thread Deenar Toraskar
I have raised a JIRA to cover this https://issues.apache.org/jira/browse/SPARK-12809 On 13 January 2016 at 16:05, Deenar Toraskar < deenar.toras...@thinkreactive.co.uk> wrote: > Frank > > Sorry got my wires crossed, I had come across another issue. Now I > remember this issue I got around this

Re: Kafka Streaming and partitioning

2016-01-13 Thread Dave
So for case 1 below - subclass or modify the direct stream and kafkardd. They're private, so you'd need to rebuild just the external kafka project, not all of spark When the data is read from Kafka it will be partitioned correctly with the Custom Partitioner passed in to the new direct stream

yarn-client: SparkSubmitDriverBootstrapper not found in yarn client mode (1.6.0)

2016-01-13 Thread Lin Zhao
My job runs fine in yarn cluster mode but I have reason to use client mode instead. But I'm hitting this error when submitting: > spark-submit --class com.exabeam.martini.scripts.SparkStreamingTest --master > yarn --deploy-mode client --executor-memory 90G --num-executors 3 > --executor-cores

Re: yarn-client: SparkSubmitDriverBootstrapper not found in yarn client mode (1.6.0)

2016-01-13 Thread Ted Yu
Can you show the complete stack trace for the error ? I searched 1.6.0 code base but didn't find the class SparkSubmitDriverBootstrapper Thanks On Wed, Jan 13, 2016 at 9:31 AM, Lin Zhao wrote: > My job runs fine in yarn cluster mode but I have reason to use client mode >

Re: Is it possible to use SparkSQL JDBC ThriftServer without Hive

2016-01-13 Thread Antonio Piccolboni
Thriftserver creates a HiveContext hence the hive libs, but you don't need to have Hive running at all. Depending on what you mean by "without Hive" that could be a positive or negative answer to your question. You need the dependencies, but you don't need a running instance. On Wed, Jan 13, 2016

Re: yarn-client: SparkSubmitDriverBootstrapper not found in yarn client mode (1.6.0)

2016-01-13 Thread Jeff Zhang
I also didn't find SparkSubmitDriverBootstrapper, which version of spark are you using ? On Wed, Jan 13, 2016 at 9:36 AM, Ted Yu wrote: > Can you show the complete stack trace for the error ? > > I searched 1.6.0 code base but didn't find the > class

Re: ml.classification.NaiveBayesModel how to reshape theta

2016-01-13 Thread Andy Davidson
Thanks Andy From: Yanbo Liang Date: Wednesday, January 13, 2016 at 6:29 AM To: Andrew Davidson Cc: "user @spark" Subject: Re: ml.classification.NaiveBayesModel how to reshape theta > Yep, row of Matrix theta is

Re: yarn-client: SparkSubmitDriverBootstrapper not found in yarn client mode (1.6.0)

2016-01-13 Thread Marcelo Vanzin
SparkSubmitDriverBootstrapper was removed back in Spark 1.4, so it seems you have a mixbag of 1.3 / 1.6 in your path / classpath and things are failing because of that. On Wed, Jan 13, 2016 at 9:31 AM, Lin Zhao wrote: > My job runs fine in yarn cluster mode but I have reason to