Re: How to limit search range without using subquery when query SQL DB via JDBC?

2016-05-13 Thread Mich Talebzadeh
Well I don't know about postgres but you can limit the number of columns abd rows fetched via JDBC at source rather than loading and filtering them in Spark val c = HiveContext.load("jdbc", Map("url" -> _ORACLEserver, "dbtable" -> "(SELECT to_char(CHANNEL_ID) AS CHANNEL_ID, CHANNEL_DESC FROM

Re: Pyspark accumulator

2016-05-13 Thread Abi
On Tue, May 10, 2016 at 2:24 PM, Abi wrote: > 1. How come pyspark does not provide the localvalue function like scala ? > > 2. Why is pyspark more restrictive than scala ?

Re: pyspark mappartions ()

2016-05-13 Thread Abi
On Tue, May 10, 2016 at 2:20 PM, Abi wrote: > Is there any example of this ? I want to see how you write the the > iterable example

How to limit search range without using subquery when query SQL DB via JDBC?

2016-05-13 Thread Jyun-Fan Tsai
I try to load some rows from a big SQL table. Here is my code: === jdbcDF = sqlContext.read.format("jdbc").options( url="jdbc:postgresql://...", dbtable="mytable", partitionColumn="t", lowerBound=1451577600, upperBound=1454256000, numPartitions=1).load()

Spark 1.4.1 + Kafka 0.8.2 with Kerberos

2016-05-13 Thread Mail.com
Hi All, I am trying to get spark 1.4.1 (Java) work with Kafka 0.8.2 in Kerberos enabled cluster. HDP 2.3.2 Is there any document I can refer to. Thanks, Pradeep - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For

support for golang

2016-05-13 Thread Sourav Chakraborty
Folks, Was curious to find out if anybody ever considered/attempted to support golang with spark . -Thanks Sourav

broadcast variable not picked up

2016-05-13 Thread abi
def kernel(arg): input = broadcast_var.value + 1 #some processing with input def foo(): broadcast_var = sc.broadcast(var) rdd.foreach(kernel) def main(): #something In this code , I get the following error: NameError: global name 'broadcast_var ' is not defined

Re: System memory 186646528 must be at least 4.718592E8.

2016-05-13 Thread satish saley
Thank you . Looking at the source code helped :) I set spark.testing.memory to 512 MB and it worked :) private def getMaxMemory(conf: SparkConf): Long = { val systemMemory = conf.getLong("spark.testing.memory", Runtime.getRuntime.maxMemory) val reservedMemory =

Re: strange behavior when I chain data frame transformations

2016-05-13 Thread Andy Davidson
Hi Ted Its seems really strange. Its seems like in the version were I used 2 data frames spark added ³as(tag)². (Which is really nice. ) Odd that I got different behavior Is this a bug? Kind regards Andy From: Ted Yu Date: Friday, May 13, 2016 at 12:38 PM To:

Re: SQLContext and HiveContext parse a query string differently ?

2016-05-13 Thread Hao Ren
Basically, I want to run the following query: select 'a\'b', case(null as Array) However, neither HiveContext and SQLContext can execute it without exception. I have tried sql(select 'a\'b', case(null as Array)) and df.selectExpr("'a\'b'", "case(null as Array)") Neither of them works.

Re: System memory 186646528 must be at least 4.718592E8.

2016-05-13 Thread Ted Yu
Here is related code: val executorMemory = conf.*getSizeAsBytes*("spark.executor.memory") if (executorMemory < minSystemMemory) { throw new IllegalArgumentException(s"Executor memory $executorMemory must be at least " + On Fri, May 13, 2016 at 12:47 PM, satish saley

Spark job fails when using checkpointing if a class change in the job

2016-05-13 Thread map reduced
Hi, I have my application jar sitting in HDFS which defines long-running Spark Streaming job and I am using checkpoint dir also in HDFS. Every time I have any changes to the job, I go delete that jar and upload a new one. Now if I upload a new jar and delete checkpoint directory it works fine.

System memory 186646528 must be at least 4.718592E8.

2016-05-13 Thread satish saley
Hello, I am running https://github.com/apache/spark/blob/branch-1.6/examples/src/main/python/pi.py example, but facing following exception What is the unit of memory pointed out in the error? Following are configs --master local[*]

Re: Executor memory requirement for reduceByKey

2016-05-13 Thread Sung Hwan Chung
Ok, so that worked flawlessly after I upped the number of partitions to 400 from 40. Thanks! On Fri, May 13, 2016 at 7:28 PM, Sung Hwan Chung wrote: > I'll try that, as of now I have a small number of partitions in the order > of 20~40. > > It would be great if

Re: strange behavior when I chain data frame transformations

2016-05-13 Thread Ted Yu
In the structure shown, tag is under element. I wonder if that was a factor. On Fri, May 13, 2016 at 11:49 AM, Andy Davidson < a...@santacruzintegration.com> wrote: > I am using spark-1.6.1. > > I create a data frame from a very complicated JSON file. I would assume > that query planer would

Re: Executor memory requirement for reduceByKey

2016-05-13 Thread Sung Hwan Chung
I'll try that, as of now I have a small number of partitions in the order of 20~40. It would be great if there's some documentation on the memory requirement wrt the number of keys and the number of partitions per executor (i.e., the Spark's internal memory requirement outside of the user space).

Re: Executor memory requirement for reduceByKey

2016-05-13 Thread Ted Yu
Have you taken a look at SPARK-11293 ? Consider using repartition to increase the number of partitions. FYI On Fri, May 13, 2016 at 12:14 PM, Sung Hwan Chung wrote: > Hello, > > I'm using Spark version 1.6.0 and have trouble with memory when trying to > do

Executor memory requirement for reduceByKey

2016-05-13 Thread Sung Hwan Chung
Hello, I'm using Spark version 1.6.0 and have trouble with memory when trying to do reducebykey on a dataset with as many as 75 million keys. I.e. I get the following exception when I run the task. There are 20 workers in the cluster. It is running under the standalone mode with 12 GB assigned

API to study key cardinality and distribution and other important statistics about data at certain stage

2016-05-13 Thread Nirav Patel
Hi, Problem is every time job fails or perform poorly at certain stages you need to study your data distribution just before THAT stage. Overall look at input data set doesn't help very much if you have so many transformation going on in DAG. I alway end up writing complicated typed code to run

strange behavior when I chain data frame transformations

2016-05-13 Thread Andy Davidson
I am using spark-1.6.1. I create a data frame from a very complicated JSON file. I would assume that query planer would treat both version of my transformation chains the same way. // org.apache.spark.sql.AnalysisException: Cannot resolve column name "tag" among (actor, body, generator, pip,

Re: Tracking / estimating job progress

2016-05-13 Thread Dood
On 5/13/2016 10:39 AM, Anthony May wrote: It looks like it might only be available via REST, http://spark.apache.org/docs/latest/monitoring.html#rest-api Nice, thanks! On Fri, 13 May 2016 at 11:24 Dood@ODDO > wrote: On 5/13/2016

Re: Tracking / estimating job progress

2016-05-13 Thread Anthony May
It looks like it might only be available via REST, http://spark.apache.org/docs/latest/monitoring.html#rest-api On Fri, 13 May 2016 at 11:24 Dood@ODDO wrote: > On 5/13/2016 10:16 AM, Anthony May wrote: > > >

Re: Tracking / estimating job progress

2016-05-13 Thread Dood
On 5/13/2016 10:16 AM, Anthony May wrote: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkStatusTracker Might be useful How do you use it? You cannot instantiate the class - is the constructor private? Thanks! On Fri, 13 May 2016 at 11:11 Ted Yu

Re: Tracking / estimating job progress

2016-05-13 Thread Anthony May
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkStatusTracker Might be useful On Fri, 13 May 2016 at 11:11 Ted Yu wrote: > Have you looked > at core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala ? > > Cheers > > On Fri,

Re: Tracking / estimating job progress

2016-05-13 Thread Ted Yu
Have you looked at core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala ? Cheers On Fri, May 13, 2016 at 10:05 AM, Dood@ODDO wrote: > I provide a RESTful API interface from scalatra for launching Spark jobs - > part of the functionality is tracking these

Tracking / estimating job progress

2016-05-13 Thread Dood
I provide a RESTful API interface from scalatra for launching Spark jobs - part of the functionality is tracking these jobs. What API is available to track the progress of a particular spark application? How about estimating where in the total job progress the job is? Thanks!

pandas dataframe broadcasted. giving errors in datanode function called kernel

2016-05-13 Thread abi
pandas dataframe is broadcasted successfully. giving errors in datanode function called kernel Code: dataframe_broadcast = sc.broadcast(dataframe) def kernel(): df_v = dataframe_broadcast.value Error: I get this error when I try accessing the value member of the broadcast variable.

Re: How to get and save core dump of native library in executors

2016-05-13 Thread prateek arora
I am running my cluster on Ubuntu 14.04 Regards Prateek -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-and-save-core-dump-of-native-library-in-executors-tp26945p26952.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Creating Nested dataframe from flat data.

2016-05-13 Thread Prashant Bhardwaj
Thank you. That's exactly I was looking for. Regards Prashant On Fri, May 13, 2016 at 9:38 PM, Xinh Huynh wrote: > Hi Prashant, > > You can create struct columns using the struct() function in > org.apache.spark.sql.functions -- > >

Re: Spark 2.0.0-snapshot: IllegalArgumentException: requirement failed: chunks must be non-empty

2016-05-13 Thread Ted Yu
Is it possible to come up with code snippet which reproduces the following ? Thanks On Fri, May 13, 2016 at 8:13 AM, Raghava Mutharaju < m.vijayaragh...@gmail.com> wrote: > I am able to run my application after I compiled Spark source in the > following way > > ./dev/change-scala-version.sh

Re: Creating Nested dataframe from flat data.

2016-05-13 Thread Xinh Huynh
Hi Prashant, You can create struct columns using the struct() function in org.apache.spark.sql.functions -- http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$ val df = sc.parallelize(List(("a", "b", "c"))).toDF("A", "B", "C") import

Re: Joining a RDD to a Dataframe

2016-05-13 Thread Xinh Huynh
Hi Cyril, In the case where there are no documents, it looks like there is a typo in "addresses" (check the number of "d"s): | scala> df.select(explode(df("addresses.id")).as("aid"), df("id")) <== addresses | org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among

Spark 2.0.0-snapshot: IllegalArgumentException: requirement failed: chunks must be non-empty

2016-05-13 Thread Raghava Mutharaju
I am able to run my application after I compiled Spark source in the following way ./dev/change-scala-version.sh 2.11 ./dev/make-distribution.sh --name spark-2.0.0-snapshot-bin-hadoop2.6 --tgz -Phadoop-2.6 -DskipTests But while the application is running I get the following exception, which I

Re: Kafka partition increased while Spark Streaming is running

2016-05-13 Thread chandan prakash
Makes sense. thank you cody. Regards, Chandan On Fri, May 13, 2016 at 8:10 PM, Cody Koeninger wrote: > No, I wouldn't expect it to, once the stream is defined (at least for > the direct stream integration for kafka 0.8), the topicpartitions are > fixed. > > My answer to any

memory leak exception

2016-05-13 Thread Imran Akbar
I'm trying to save a table using this code in pyspark with 1.6.1: prices = sqlContext.sql("SELECT AVG(amount) AS mean_price, country FROM src GROUP BY country") prices.collect() prices.write.saveAsTable('prices', format='parquet', mode='overwrite', path='/mnt/bigdisk/tables') but I'm getting

Re: Kafka partition increased while Spark Streaming is running

2016-05-13 Thread Cody Koeninger
No, I wouldn't expect it to, once the stream is defined (at least for the direct stream integration for kafka 0.8), the topicpartitions are fixed. My answer to any question about "but what if checkpoints don't let me do this" is always going to be "well, don't rely on checkpoints." If you want

Re: The metastore database gives errors when start spark-sql CLI.

2016-05-13 Thread Mich Talebzadeh
I do not know Postgres but that sounds like a system table much like Oracle v$instance? Why running a Hive schema script against a hive schema/DB in Postgres should impact system schema? Mine is Oracle s...@mydb12.mich.LOCAL> SELECT version FROM v$instance; VERSION - 12.1.0.2.0

Re: Confused - returning RDDs from functions

2016-05-13 Thread Dood
On 5/12/2016 10:01 PM, Holden Karau wrote: This is not the expected behavior, can you maybe post the code where you are running into this? Hello, thanks for replying! Below is the function I took out from the code. def converter(rdd:

SparkSql Catalyst extending Analyzer, Error with CatalystConf

2016-05-13 Thread sib
Hello, I am trying to write a basic analyzer, by extending the catalyst analyzer with a few extra rules. I am getting the following error: *""" trait CatalystConf in package catalyst cannot be accessed in package org.apache.spark.sql.catalyst """* In my attempt I am doing the following: class

Re: Datasets is extremely slow in comparison to RDD in standalone mode WordCount examlpe

2016-05-13 Thread Amit Sela
Taking it to a more basic level, I compared between a simple transformation with RDDs and with Datasets. This is far simpler than Renato's use case and this brungs up two good question: 1. Is the time it takes to "spin-up" a standalone instance of Spark(SQL) is just an additional one-time overhead

SparkSql Catalyst extending Analyzer, Error with CatalystConf

2016-05-13 Thread Alexander Sibetheros
Hello, I am trying to write a basic analyzer, by extending the catalyst analyzer with a few extra rules. I am getting the following error: """ *trait CatalystConf in package catalyst cannot be accessed in package org.apache.spark.sql.catalyst* """ In my attempt I am doing the following: class

Re: Java: Return type of RDDFunctions.sliding(int, int)

2016-05-13 Thread Tom Godden
I corrected the type to RDD, but it's still giving me the error. I believe I have found the reason though. The vals variable is created using the map procedure on some other RDD. Although it is declared as a JavaRDD, the classTag it returns is Object. I think that because

Creating Nested dataframe from flat data.

2016-05-13 Thread Prashant Bhardwaj
Hi Let's say I have a flat dataframe with 6 columns like. { "a": "somevalue", "b": "somevalue", "c": "somevalue", "d": "somevalue", "e": "somevalue", "f": "somevalue" } Now I want to convert this dataframe to contain nested column like { "nested_obj1": { "a": "somevalue", "b": "somevalue" },

Re: ANOVA test in Spark

2016-05-13 Thread mylisttech
Mayank, Assuming Anova not present in MLIB can you not exploit the Anova from SparkR? I am enquiring not making a factual statement. Thanks On May 13, 2016, at 15:54, mayankshete wrote: > Is ANOVA present in Spark Mllib if not then, when will be this feature be >

Re: Java: Return type of RDDFunctions.sliding(int, int)

2016-05-13 Thread Sean Owen
The Java docs won't help since they only show "Object", yes. Have a look at the Scala docs: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.rdd.RDDFunctions An RDD of T produces an RDD of T[]. On Fri, May 13, 2016 at 12:10 PM, Tom Godden

Re: Java: Return type of RDDFunctions.sliding(int, int)

2016-05-13 Thread Tom Godden
I assumed the "fixed size blocks" mentioned in the documentation (https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/mllib/rdd/RDDFunctions.html#sliding%28int,%20int%29) were RDDs, but I guess they're arrays? Even when I change the RDD to arrays (so it looks like RDD), it

Re: Graceful shutdown of spark streaming on yarn

2016-05-13 Thread Rakesh H (Marketing Platform-BLR)
Have you used awaitTermination() on your ssc ? --> Yes, i have used that. Also try setting the deployment mode to yarn-client. --> Is this not supported on yarn-cluster mode? I am trying to find root cause for yarn-cluster mode. Have you tested graceful shutdown on yarn-cluster mode? On Fri, May

Re: Java: Return type of RDDFunctions.sliding(int, int)

2016-05-13 Thread Sean Owen
I'm not sure what you're trying there. The return type is an RDD of arrays, not of RDDs or of ArrayLists. There may be another catch but that is not it. On Fri, May 13, 2016 at 11:50 AM, Tom Godden wrote: > I believe it's an illegal cast. This is the line of code: >>

Re: Java: Return type of RDDFunctions.sliding(int, int)

2016-05-13 Thread Tom Godden
I believe it's an illegal cast. This is the line of code: > RDD> windowed = > RDDFunctions.fromRDD(vals.rdd(), vals.classTag()).sliding(20, 1); with vals being a JavaRDD. Explicitly casting doesn't work either: > RDD> windowed = (RDD>) >

Re: sbt for Spark build with Scala 2.11

2016-05-13 Thread Raghava Mutharaju
Thank you for the response. I used the following command to build from source build/mvn -Dhadoop.version=2.6.4 -Phadoop-2.6 -DskipTests clean package Would this put in the required jars in .ivy2 during the build process? If so, how can I make the spark distribution runnable, so that I can use

ANOVA test in Spark

2016-05-13 Thread mayankshete
Is ANOVA present in Spark Mllib if not then, when will be this feature be available in Spark ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ANOVA-test-in-Spark-tp26949.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: S3A Creating Task Per Byte (pyspark / 1.6.1)

2016-05-13 Thread Steve Loughran
On 12 May 2016, at 18:35, Aaron Jackson > wrote: I'm using the spark 1.6.1 (hadoop-2.6) and I'm trying to load a file that's in s3. I've done this previously with spark 1.5 with no issue. Attempting to load and count a single file as follows:

Re: Kafka partition increased while Spark Streaming is running

2016-05-13 Thread chandan prakash
Follow up question : If spark streaming is using checkpointing (/tmp/checkpointDir) for AtLeastOnce and number of Topics or/and partitions has increased then will gracefully shutting down and restarting from checkpoint will consider new topics or/and partitions ? If the answer is NO

Re: High virtual memory consumption on spark-submit client.

2016-05-13 Thread jone
no, i have set master to yarn-cluster. when the sparkpi.running,the result of  free -t as follow [running]mqq@10.205.3.29:/data/home/hive/conf$ free -t total   used   free shared    buffers cached Mem:  32740732   32105684 635048  0 683332  

The metastore database gives errors when start spark-sql CLI.

2016-05-13 Thread Joseph
Hi all, I use PostgreSQL to store the hive metadata. First, I imported a sql script to metastore database as follows: psql -U postgres -d metastore -h 192.168.50.30 -f hive-schema-1.2.0.postgres.sql Then, when I started $SPARK_HOME/bin/spark-sql, the PostgreSQL gave the following

Re: Java: Return type of RDDFunctions.sliding(int, int)

2016-05-13 Thread Sean Owen
The problem is there's no Java-friendly version of this, and the Scala API return type actually has no analog in Java (an array of any type, not just of objects) so it becomes Object. You can just cast it to the type you know it will be -- RDD or RDD or whatever. On Fri, May 13,

Java: Return type of RDDFunctions.sliding(int, int)

2016-05-13 Thread tgodden
Hello, We're trying to use PrefixSpan on sequential data, by passing a sliding window over it. Spark Streaming is not an option. RDDFunctions.sliding() returns an item of class RDD, regardless of the original type of the RDD. Because of this, the returned item seems to be pretty much worthless.

When start spark-sql, postgresql gives errors.

2016-05-13 Thread Joseph
Hi all, I use PostgreSQL to store the hive metadata. First, I imported a sql script to metastore database as follows: psql -U postgres -d metastore -h 192.168.50.30 -f hive-schema-1.2.0.postgres.sql Then, when I started $SPARK_HOME/bin/spark-sql, the PostgreSQL gave the following

Re: Spark handling spill overs

2016-05-13 Thread Mich Talebzadeh
Spill-overs are a common issue for in-memory computing systems, after all memory is limited. In Spark where RDDs are immutable, if an RDD got created with its size > 1/2 node's RAM then a transformation and generation of the consequent RDD' can potentially fill all the node's memory that can

Re: Graceful shutdown of spark streaming on yarn

2016-05-13 Thread Deepak Sharma
Rakesh Have you used awaitTermination() on your ssc ? If not , dd this and see if it changes the behavior. I am guessing this issue may be related to yarn deployment mode. Also try setting the deployment mode to yarn-client. Thanks Deepak On Fri, May 13, 2016 at 10:17 AM, Rakesh H (Marketing