Passing binding variable in query used in Data Source API

2016-01-21 Thread satish chandra j
Hi All, We have requirement to fetch data from source PostgreSQL database as per a condition, hence need to pass a binding variable in query used in Data Source API as below: var DeptNbr = 10 val dataSource_dF=cc.load("jdbc",Map("url"->"jdbc:postgresql://

Re: is recommendProductsForUsers available in ALS?

2016-01-21 Thread Nick Pentreath
These methods are available in Spark 1.6 On Tue, Jan 19, 2016 at 12:18 AM, Roberto Pagliari < roberto.pagli...@asos.com> wrote: > With Spark 1.5, the following code: > > from pyspark import SparkContext, SparkConf > from pyspark.mllib.recommendation import ALS, Rating > r1 = (1, 1,

Re: spark job submisson on yarn-cluster mode failing

2016-01-21 Thread Soni spark
Hi, I am facing below error msg now. please help me. 2016-01-21 16:06:14,123 WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /xxx.xx.xx.xx:50010 for block, add to deadNodes and continue. java.nio.channels.ClosedByInterruptException java.nio.channels.ClosedByInterruptException at

Re: Concurrent Spark jobs

2016-01-21 Thread emlyn
Thanks for the responses (not sure why they aren't showing up on the list). Michael wrote: > The JDBC wrapper for Redshift should allow you to follow these > instructions. Let me know if you run into any more issues. >

question about query SparkSQL

2016-01-21 Thread Eli Super
Hi I try to save parts of large table as csv files I use following commands : sqlContext.sql("select * from my_table where trans_time between '2015/12/18 12:00' and '2015/12/18 12:06'").write.format("com.databricks.spark.csv").option("header", "false").save('00_06') and sqlContext.sql("select

Client versus cluster mode

2016-01-21 Thread Afshartous, Nick
Hi, In an AWS EMR/Spark 1.5 cluster we're launching a streaming job from the driver node. Would it make any sense in this case to use cluster mode ? More specifically would there be any benefit that YARN would provide when using cluster but not client mode ? Thanks, -- Nick

Re: spark job submisson on yarn-cluster mode failing

2016-01-21 Thread Akhil Das
Can you look in the executor logs and see why the sparkcontext is being shutdown? Similar discussion happened here previously. http://apache-spark-user-list.1001560.n3.nabble.com/RECEIVED-SIGNAL-15-SIGTERM-td23668.html Thanks Best Regards On Thu, Jan 21, 2016 at 5:11 PM, Soni spark

Spark Yarn executor memory overhead content

2016-01-21 Thread Olivier Devoisin
Hello, In some of our spark applications, when writing outputs to hdfs we encountered an error about the spark yarn executor memory overhead : WARN yarn.YarnAllocator: Container killed by YARN for exceeding memory limits. 3.0 GB of 3 GB physical memory used. Consider boosting

Re: Client versus cluster mode

2016-01-21 Thread Manoj Awasthi
The only difference is that in yarn-cluster mode your driver runs within a yarn container (called AM or application master). You would want to run your production jobs in yarn-cluster mode while for development environment may do with yarn-client mode. Again, I think this just a recommendation

How to setup a long running spark streaming job with continuous window refresh

2016-01-21 Thread Santoshakhilesh
Hi, I have following scenario in my project; 1.I will continue to get a stream of data from a source 2.I need to calculate mean and variance for a key every minute 3.After minute is over I should restart fresh computing the values for new minute Example: 10:00:00 computation and

Number of executors in Spark - Kafka

2016-01-21 Thread Guillermo Ortiz
I'm using Spark Streaming and Kafka with Direct Approach. I have created a topic with 6 partitions so when I execute Spark there are six RDD. I understand than ideally it should have six executors to process each one one RDD. To do it, when I execute spark-submit (I use YARN) I specific the

Re: spark job submisson on yarn-cluster mode failing

2016-01-21 Thread Ted Yu
Please also check AppMaster log. Thanks > On Jan 21, 2016, at 3:51 AM, Akhil Das wrote: > > Can you look in the executor logs and see why the sparkcontext is being > shutdown? Similar discussion happened here previously. >

Spark 1.6 ignoreNulls in first/last aggregate functions

2016-01-21 Thread emlyn
As I understand it, Spark 1.6 changes the behaviour of the first and last aggregate functions to take nulls into account (where they were ignored in 1.5). From SQL you can use "IGNORE NULLS" to get the old behaviour back. How do I ignore nulls

RE: Container exited with a non-zero exit code 1-SparkJOb on YARN

2016-01-21 Thread Siddharth Ubale
Hi Wellington, Thanks for the reply. I have kept the default values for the below 2 features which have been mentioned. The zip file is expected by the spark job in the spark staging folder in hdfs. None of the documentation has mentioned regarding this file. Also, I have noticed one more

Spark Streaming Write Ahead Log (WAL) not replaying data after restart

2016-01-21 Thread Patrick McGloin
Hi all, To have a simple way of testing the Spark Streaming Write Ahead Log I created a very simple Custom Input Receiver, which will generate strings and store those: class InMemoryStringReceiver extends Receiver[String](StorageLevel.MEMORY_AND_DISK_SER) { val batchID =

spark job submisson on yarn-cluster mode failing

2016-01-21 Thread Soni spark
Hi Friends, I spark job is successfully running on local mode but failing on cluster mode. Below is the error message i am getting. anyone can help me. 16/01/21 16:38:07 INFO twitter4j.TwitterStreamImpl: Establishing connection. 16/01/21 16:38:07 INFO twitter.TwitterReceiver: Twitter receiver

Re: best practice : how to manage your Spark cluster ?

2016-01-21 Thread Arkadiusz Bicz
Hi Charles, We are using Ambari for hadoop / spark services management, version and monitoring in cluster. For Spark jobs and cluster hosts, discs, memory, cpu, network realtime monitoring we use graphite + grafana + collectd + spark metrics

Re: Parquet write optimization by row group size config

2016-01-21 Thread Pavel Plotnikov
I have got about 25 separated gzipped log files per hour. File sizes is very different, from 10MB to 50MB of gzipped JSON data. So, i'am convert this data in parquet each hour. Code very simple on python: text_file = sc.textFile(src_file) df = sqlCtx.jsonRDD(text_file.map(lambda x:

Re: How to use scala.math.Ordering in java

2016-01-21 Thread Dave
Thanks Ted. On 20/01/16 18:24, Ted Yu wrote: Please take a look at the following files for some examples: sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java Cheers On Wed,

cast column string -> timestamp in Parquet file

2016-01-21 Thread Eli Super
Hi I have a large size parquet file . I need to cast the whole column to timestamp format , then save What the right way to do it ? Thanks a lot

Re: Number of executors in Spark - Kafka

2016-01-21 Thread Cody Koeninger
6 kafka partitions will result in 6 spark partitions, not 6 spark rdds. The question of whether you will have a backlog isn't just a matter of having 1 executor per partition. If a single executor can process all of the partitions fast enough to complete a batch in under the required time, you

Re: java.lang.ArrayIndexOutOfBoundsException when attempting broadcastjoin

2016-01-21 Thread Sebastian Piu
I'm using Spark 1.6.0. I tried removing Kryo and reverting back to Java Serialisation, and get a different error which maybe points in the right direction... java.lang.AssertionError: assertion failed: No plan for BroadcastHint +- InMemoryRelation

Re: Spark job stops after a while.

2016-01-21 Thread Guillermo Ortiz
I'm using 1.5.0 of Spark confirmed. Less this jar file:/opt/centralLogs/lib/spark-catalyst_2.10-1.5.1.jar. I'm going to keep looking for,, Thank you!. 2016-01-21 16:29 GMT+01:00 Ted Yu : > Maybe this is related (fixed in 1.5.3): > SPARK-11195 Exception thrown on executor

Re: [Spark Streaming][Problem with DataFrame UDFs]

2016-01-21 Thread Cody Koeninger
If you can share an isolated example I'll take a look. Not something I've run into before. On Wed, Jan 20, 2016 at 3:53 PM, jpocalan wrote: > Hi, > > I have an application which creates a Kafka Direct Stream from 1 topic > having 5 partitions. > As a result each batch is

Re: Passing binding variable in query used in Data Source API

2016-01-21 Thread Todd Nist
Hi Satish, You should be able to do something like this: val props = new java.util.Properties() props.put("user", username) props.put("password",pwd) props.put("driver", "org.postgresql.Drive") val deptNo = 10 val where = Some(s"dept_number = $deptNo") val df =

Re: Spark 1.6 ignoreNulls in first/last aggregate functions

2016-01-21 Thread emlyn
Turns out I can't use a user defined aggregate function, as they are not supported in Window operations. There surely must be some way to do a last_value with ignoreNulls enabled in Spark 1.6? Any ideas for workarounds? -- View this message in context:

Re: Spark job stops after a while.

2016-01-21 Thread Ted Yu
Looks like jar containing EsHadoopIllegalArgumentException class wasn't in the classpath. Can you double check ? Which Spark version are you using ? Cheers On Thu, Jan 21, 2016 at 6:50 AM, Guillermo Ortiz wrote: > I'm runing a Spark Streaming process and it stops in a

RE: Spark SQL . How to enlarge output rows ?

2016-01-21 Thread Spencer, Alex (Santander)
I forgot to add this is (I think) from 1.5.0. And yeah that looks like a Python – I’m not hot with Python but it may be capitalised as False or FALSE? From: Eli Super [mailto:eli.su...@gmail.com] Sent: 21 January 2016 14:48 To: Spencer, Alex (Santander) Cc: user@spark.apache.org Subject: Re:

Spark job stops after a while.

2016-01-21 Thread Guillermo Ortiz
I'm runing a Spark Streaming process and it stops in a while. It makes some process an insert the result in ElasticSeach with its library. After a while the process fail. I have been checking the logs and I have seen this error 2016-01-21 14:57:54,388

java.lang.ArrayIndexOutOfBoundsException when attempting broadcastjoin

2016-01-21 Thread sebastian.piu
Hi all, I'm trying to work out a problem when using Spark Streaming, currently I have the following piece of code inside a foreachRDD call: Dataframe results = ... //some dataframe created from the incoming rdd - moderately big, I don't want this to be shuffled DataFrame t =

Re: Spark job stops after a while.

2016-01-21 Thread Ted Yu
Maybe this is related (fixed in 1.5.3): SPARK-11195 Exception thrown on executor throws ClassNotFoundException on driver FYI On Thu, Jan 21, 2016 at 7:10 AM, Guillermo Ortiz wrote: > I'm using CDH 5.5.1 with Spark 1.5.x (I think that it's 1.5.2). > > I know that the

Re: spark job submisson on yarn-cluster mode failing

2016-01-21 Thread Ted Yu
Exception below is at WARN level. Can you check hdfs healthiness ? Which hadoop version are you using ? There should be other fatal error if your job failed. Cheers On Thu, Jan 21, 2016 at 4:50 AM, Soni spark wrote: > Hi, > > I am facing below error msg now. please

Re: Spark SQL . How to enlarge output rows ?

2016-01-21 Thread Eli Super
Thanks Alex I get NameError NameError: name 'false' is not defined Is it because of PySpark ? On Thu, Jan 14, 2016 at 3:34 PM, Spencer, Alex (Santander) < alex.spen...@santander.co.uk> wrote: > Hi, > > > > Try …..show(*false*) > > > > public void show(int numRows, > > boolean

Re: Spark job stops after a while.

2016-01-21 Thread Guillermo Ortiz
I'm using CDH 5.5.1 with Spark 1.5.x (I think that it's 1.5.2). I know that the library is here: cloud-user@ose10kafkaelk:/opt/centralLogs/lib$ jar tf elasticsearch-hadoop-2.2.0-beta1.jar | grep EsHadoopIllegalArgumentException org/elasticsearch/hadoop/EsHadoopIllegalArgumentException.class I

10hrs of Scheduler Delay

2016-01-21 Thread Sanders, Isaac B
Hey all, I am a CS student in the United States working on my senior thesis. My thesis uses Spark, and I am encountering some trouble. I am using https://github.com/alitouka/spark_dbscan, and to determine parameters, I am using the utility class they supply,

No plan for BroadcastHint when attempting broadcastjoin

2016-01-21 Thread Ted Yu
Modified subject to reflect new error encountered. Interesting - SPARK-12275 is marked fixed against 1.6.0 On Thu, Jan 21, 2016 at 7:30 AM, Sebastian Piu wrote: > I'm using Spark 1.6.0. > > I tried removing Kryo and reverting back to Java Serialisation, and get a >

Re: cast column string -> timestamp in Parquet file

2016-01-21 Thread Muthu Jayakumar
DataFrame and udf. This may be more performant than doing an RDD transformation as you'll only transform just the column that requires to be changed. Hope this helps. On Thu, Jan 21, 2016 at 6:17 AM, Eli Super wrote: > Hi > > I have a large size parquet file . > > I need

Re: java.lang.ArrayIndexOutOfBoundsException when attempting broadcastjoin

2016-01-21 Thread Ted Yu
You were using Kryo serialization ? If you switch to Java serialization, your job should run fine. Which Spark release are you using ? Thanks On Thu, Jan 21, 2016 at 6:59 AM, sebastian.piu wrote: > Hi all, > > I'm trying to work out a problem when using Spark

Re: [Spark Streaming][Problem with DataFrame UDFs]

2016-01-21 Thread Jean-Pierre OCALAN
Quick correction in the code snippet I sent in my previous email: Line: val enrichedDF = inputDF.withColumn("semantic", udf(col("url"))) Should be replaced by: val enrichedDF = inputDF.withColumn("semantic", enrichUDF(col("url"))) On Thu, Jan 21, 2016 at 11:07 AM, Jean-Pierre OCALAN

Recovery for Spark Streaming Kafka Direct with OffsetOutOfRangeException

2016-01-21 Thread Dan Dutrow
Hey Cody, I would have responded to the mailing list but it looks like this thread got aged off. I have the problem where one of my topics dumps more data than my spark job can keep up with. We limit the input rate with maxRatePerPartition Eventually, when the data is aged off, I get the

Re: Spark Cassandra Java Connector: records missing despite consistency=ALL

2016-01-21 Thread Dennis Birkholz
Hi Anthony, no, the logging is not done via Spark (but PHP). But that does not really matter, as the records are eventually there. So it is the READ_CONSISTENCY=ALL that is not working. Btw. it seems that using withReadConf() and setting the consistency level there is working but I need to

[ANNOUNCE] Apache Nutch 2.3.1 Release

2016-01-21 Thread lewis john mcgibbney
Hi Folks, !!Apologies for cross posting!! The Apache Nutch PMC are pleased to announce the immediate release of Apache Nutch v2.3.1, we advise all current users and developers of the 2.X series to upgrade to this release. Nutch is a well matured, production ready Web crawler. Nutch 2.X branch

Re: 10hrs of Scheduler Delay

2016-01-21 Thread Ted Yu
Can you provide a bit more information ? command line for submitting Spark job version of Spark anything interesting from driver / executor logs ? Thanks On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B wrote: > Hey all, > > I am a CS student in the United States

Re: 10hrs of Scheduler Delay

2016-01-21 Thread Sanders, Isaac B
The Spark Version is 1.4.1 The logs are full of standard fair, nothing like an exception or even interesting [INFO] lines. Here is the script I am using: https://gist.github.com/isaacsanders/660f480810fbc07d4df2 Thanks Isaac On Jan 21, 2016, at 11:03 AM, Ted Yu

Re: Passing binding variable in query used in Data Source API

2016-01-21 Thread Kevin Mellott
Another alternative that you can consider is to use Sqoop to move your data from PostgreSQL to HDFS, and then just load it into your DataFrame without needing to use JDBC drivers. I've had success using this approach, and depending on your setup you can easily

Re: retrieve cell value from a rowMatrix.

2016-01-21 Thread Srivathsan Srinivas
Hi Zhang, I am new to Scala and Spark. I am not a Java guy (more of Python and R guy). Just began playing with matrices in MlLib and looks painful to do simple things. If you can show me a small example, it would help. Apply function is not available in RowMatrix. For eg., import

Re: updateStateByKey not persisting in Spark 1.5.1

2016-01-21 Thread Ted Yu
I searched for checkpoint related methods in various Listener classes but haven't found any. Analyzing DAG is tedious and fragile since DAG may change in future Spark releases. Cheers On Thu, Jan 21, 2016 at 8:25 AM, Brian London wrote: > Thanks. It looks like

Re: Recovery for Spark Streaming Kafka Direct with OffsetOutOfRangeException

2016-01-21 Thread Cody Koeninger
Looks like this response did go to the list. As far as OffsetOutOfRange goes, right now that's an unrecoverable error, because it breaks the underlying invariants (e.g. that the number of messages in a partition is deterministic once the RDD is defined) If you want to do some hacking for your

Re: Re: --driver-java-options not support multiple JVM configuration ?

2016-01-21 Thread Marcelo Vanzin
That's because you should have used the other, correct version, separated by spaces instead of commas. On Wed, Jan 20, 2016 at 9:55 PM, our...@cnsuning.com wrote: > Marcelo, > error also exists with quotes around "$sparkdriverextraJavaOptions": > > Unrecognized VM option >

Re: Spark Yarn executor memory overhead content

2016-01-21 Thread Marcelo Vanzin
On Thu, Jan 21, 2016 at 5:42 AM, Olivier Devoisin wrote: > The documentation states that it contains VM overheads, interned strings and > other native overheads. However it's really vague. It's intentionally vague, because it's "everything that is not Java

Re: SparkContext SyntaxError: invalid syntax

2016-01-21 Thread Andrew Weiner
Thanks Felix. I think I was missing gem install pygments.rb and I also had to roll back to Python 2.7 but I got it working. I submitted the PR submitted with the added explanation in the docs. Andrew On Wed, Jan 20, 2016 at 1:44 AM, Felix Cheung wrote: > > I have

Re: No plan for BroadcastHint when attempting broadcastjoin

2016-01-21 Thread Sebastian Piu
I made some modifications to my code where I broadcast the Dataframe I want to join directly through the SparkContext, and that seems to work as expected. Still don't understand what is going wrong with the missing Plan. On Thu, Jan 21, 2016 at 3:36 PM, Ted Yu wrote: >

Date / time stuff with spark.

2016-01-21 Thread Andrew Holway
Hello, I am importing this data from HDFS into a data frame with sqlContext.read.json(). {“a": 42, “a": 56, "Id": "621368e2f829f230", “smunkId": "CKm26sDMucoCFReRGwodbHAAgw", “popsicleRange": "17610", "time": "2016-01-20T23:59:53+00:00”} I want to do some date/time operations on this json data

Re: Spark job stops after a while.

2016-01-21 Thread Guillermo Ortiz
I think that it's that bug, because the error is the same.. thanks a lot. 2016-01-21 16:46 GMT+01:00 Guillermo Ortiz : > I'm using 1.5.0 of Spark confirmed. Less this > jar file:/opt/centralLogs/lib/spark-catalyst_2.10-1.5.1.jar. > > I'm going to keep looking for,, Thank

Re: [Spark Streaming][Problem with DataFrame UDFs]

2016-01-21 Thread Jean-Pierre OCALAN
Hi Cody, First of all thanks a lot for your quick reply, although I have removed this post couple of hours after posting it because I ended up finding it was due to the way I was using DataFrame UDFs. Essentially I didn't know that UDFs were purely lazy and in case of the example below the UDF

Re: updateStateByKey not persisting in Spark 1.5.1

2016-01-21 Thread Brian London
Thanks. It looks like extending my batch duration to 7 seconds is a work-around. I'd like to build a check for the lack of checkpointing in our integration tests. Is there a way to parse the DAG at runtime? On Wed, Jan 20, 2016 at 2:01 PM Ted Yu wrote: > This is related:

MemoryStore: Not enough space to cache broadcast_N in memory

2016-01-21 Thread Arun Luthra
WARN MemoryStore: Not enough space to cache broadcast_4 in memory! (computed 60.2 MB so far) WARN MemoryStore: Persisting block broadcast_4 to disk instead. Can I increase the memory allocation for broadcast variables? I have a few broadcast variables that I create with sc.broadcast() . Are

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Arun Luthra
16/01/21 21:52:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/01/21 21:52:14 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set. spark.yarn.driver.memoryOverhead is set

Re: Job History Logs for spark jobs submitted on YARN

2016-01-21 Thread nsalian
Hello, Thanks for the question. 1) Typically the Resource Manager in YARN would print out the Aggregate Resource Allocation for the application after you have found the specific application using the application id. 2) As MapReduce, there is a parameter that is part of either the

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Josh Rosen
Is speculation enabled? This TaskCommitDenied by driver error is thrown by writers who lost the race to commit an output partition. I don't think this had anything to do with key skew etc. Replacing the groupbykey with a count will mask this exception because the coordination does not get

Re: Getting Co-oefficients of a logistic regression model for a pipelinemodel Spark ML library

2016-01-21 Thread Holden Karau
Hi Vinayaka, You can access the different stages in your pipeline through the stages array on our pipeline model ( http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.PipelineModel ) and then cast it to the correct stage (if working in Scala or if in Python just access

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Holden Karau
My hunch is that the TaskCommitDenied is perhaps a red hearing and the problem is groupByKey - but I've also just seen a lot of people be bitten by it so that might not be issue. If you just do a count at the point of the groupByKey does the pipeline succeed? On Thu, Jan 21, 2016 at 2:56 PM, Arun

TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Arun Luthra
Example warning: 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID 4436, XXX): TaskCommitDenied (Driver denied task commit) for job: 1, partition: 2168, attempt: 4436 Is there a solution for this? Increase driver memory? I'm using just 1G driver memory but ideally I

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Arun Luthra
Usually the pipeline works, it just failed on this particular input data. The other data it has run on is of similar size. Speculation is enabled. I'm using Spark 1.5.0. Here is the config. Many of these may not be needed anymore, they are from trying to get things working in Spark 1.2 and 1.3.

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Holden Karau
Can you post more of your log? How big are the partitions? What is the action you are performing? On Thu, Jan 21, 2016 at 2:02 PM, Arun Luthra wrote: > Example warning: > > 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID > 4436, XXX):

Getting Co-oefficients of a logistic regression model for a pipelinemodel Spark ML library

2016-01-21 Thread Vinayak Agrawal
Hi All, I am working with Spark ML package, NOT mllib. I have a working pipelinemodel. I am trying to get the co-efficients of my model but I cant find a method to do so. Documentation here http://spark.apache.org/docs/latest/ml-classification-regression.html#logistic-regression shows how to get

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Holden Karau
Before we dig too far into this, the thing which most quickly jumps out to me is groupByKey which could be causing some problems - whats the distribution of keys like? Try replacing the groupByKey with a count() and see if the pipeline works up until that stage. Also 1G of driver memory is also a

Re: General Question (Spark Hive integration )

2016-01-21 Thread Silvio Fiorito
Also, just to clarify it doesn’t read the whole table into memory unless you specifically cache it. From: Silvio Fiorito > Date: Thursday, January 21, 2016 at 10:02 PM To: "Balaraju.Kagidala Kagidala"

Re: 10hrs of Scheduler Delay

2016-01-21 Thread Ted Yu
Looks like you were running on YARN. What hadoop version are you using ? Can you capture a few stack traces of the AppMaster during the delay and pastebin them ? Thanks On Thu, Jan 21, 2016 at 8:08 AM, Sanders, Isaac B wrote: > The Spark Version is 1.4.1 > > The

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Arun Luthra
Looking into the yarn logs for a similar job where an executor was associated with the same error, I find: ... 16/01/22 01:17:18 INFO client.TransportClientFactory: Found inactive connection to (SERVER), creating a new one. 16/01/22 01:17:18 *ERROR shuffle.RetryingBlockFetcher: Exception while

Re: General Question (Spark Hive integration )

2016-01-21 Thread Silvio Fiorito
Hi Bala, It depends on how your Hive table is configured. If you used partitioning and you are filtering on a partition column then it will only load the relevant partitions. If, however, you’re filtering on a non-partitioned column then it will have to read all the data and then filter as

Re: General Question (Spark Hive integration )

2016-01-21 Thread Bala
Thanks for the response Silvio, my table is not partitioned because my filter column is primary key , I guess we can't partition on primary key column. My table has 600 million data if I query single regard it seems by default its loding whole data and taking some time to just return single

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Arun Luthra
Two changes I made that appear to be keeping various errors at bay: 1) bumped up spark.yarn.executor.memoryOverhead to 2000 in the spirit of https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3ccacbyxkld8qasymj2ghk__vttzv4gejczcqfaw++s1d5te1d...@mail.gmail.com%3E . Even though I

General Question (Spark Hive integration )

2016-01-21 Thread Balaraju.Kagidala Kagidala
Hi , I have simple question regarding Spark Hive integration with DataFrames. When we query for a table, does spark loads whole table into memory and applies the filter on top of it or it only loads the data with filter applied. for example if the my query 'select * from employee where

Re: 10hrs of Scheduler Delay

2016-01-21 Thread Ted Yu
You may have noticed the following - did this indicate prolonged computation in your code ? org.apache.commons.math3.util.MathArrays.distance(MathArrays.java:205) org.apache.commons.math3.ml.distance.EuclideanDistance.compute(EuclideanDistance.java:34)

Re: 10hrs of Scheduler Delay

2016-01-21 Thread Sanders, Isaac B
That thread seems to be moving, it oscillates between a few different traces… Maybe it is working. It seems odd that it would take that long. This is 3rd party code, and after looking at some of it, I think it might not be as Spark-y as it could be. I linked it below. I don’t know a lot about

Spark partition size tuning

2016-01-21 Thread Jia Zou
Dear all! When using Spark to read from local file system, the default partition size is 32MB, how can I increase the partition size to 128MB, to reduce the number of tasks? Thank you very much! Best Regards, Jia

Re: 10hrs of Scheduler Delay

2016-01-21 Thread Darren Govoni
I've experienced this same problem. Always the last stage hangs. Indeterminant. No errors in logs. I run spark 1.5.2. Can't find an explanation. But it's definitely a showstopper. Sent from my Verizon Wireless 4G LTE smartphone Original message From: Ted Yu

Re: 10hrs of Scheduler Delay

2016-01-21 Thread Sanders, Isaac B
Hadoop is: HDP 2.3.2.0-2950 Here is a gist (pastebin) of my versions en masse and a stacktrace: https://gist.github.com/isaacsanders/2e59131758469097651b Thanks On Jan 21, 2016, at 7:44 PM, Ted Yu > wrote: Looks like you were running on YARN.

Re: 10hrs of Scheduler Delay

2016-01-21 Thread Ted Yu
You may have seen the following on github page: Latest commit 50fdf0e on Feb 22, 2015 That was 11 months ago. Can you search for similar algorithm which runs on Spark and is newer ? If nothing found, consider running the tests coming from the project to determine whether the delay is

Re: 10hrs of Scheduler Delay

2016-01-21 Thread Sanders, Isaac B
I have run the driver on a smaller dataset (k=2, n=5000) and it worked quickly and didn’t hang like this. This dataset is closer to k=10, n=4.4m, but I am using more resources on this one. - Isaac On Jan 21, 2016, at 11:06 PM, Ted Yu > wrote:

avg(df$column) not returning a value but just the text "Column avg"

2016-01-21 Thread Devesh Raj Singh
Hi, I want to create average of numerical columns in iris dataset using sparkR Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.3.0" "sparkr-shell"') library(SparkR) sc=sparkR.init(master="local",sparkHome =

Re: Date / time stuff with spark.

2016-01-21 Thread Andrew Holway
P.S. We are working with Python. On Thu, Jan 21, 2016 at 8:24 PM, Andrew Holway wrote: > Hello, > > I am importing this data from HDFS into a data frame with > sqlContext.read.json(). > > {“a": 42, “a": 56, "Id": "621368e2f829f230", “smunkId": >

Re: process of executing a program in a distributed environment without hadoop

2016-01-21 Thread nsalian
Thanks for the question. The documentation here: https://spark.apache.org/docs/latest/submitting-applications.html#launching-applications-with-spark-submit enlists a variety of submission techniques. You can vary the Master URLs to suit your needs whether it be local/ yarn or mesos. -