Re: Dynamic partition pruning

2015-10-16 Thread Michael Armbrust
We don't support dynamic partition pruning yet. On Fri, Oct 16, 2015 at 10:20 AM, Younes Naguib < younes.nag...@tritondigital.com> wrote: > Hi all > > > > I’m running sqls on spark 1.5.1 and using tables based on parquets. > > My tables are not pruned when joined on partition columns. > > Ex: >

PySpark + Streaming + DataFrames

2015-10-16 Thread Jason White
I'm trying to create a DStream of DataFrames using PySpark. I receive data from Kafka in the form of a JSON string, and I'm parsing these RDDs of Strings into DataFrames. My code is: I get the following error at pyspark/streaming/util.py, line 64: I've verified that the sqlContext is properly

Re: Turn off logs in spark-sql shell

2015-10-16 Thread Jakob Odersky
[repost to mailing list, ok I gotta really start hitting that reply-to-all-button] Hi, Spark uses Log4j which unfortunately does not support fine-grained configuration over the command line. Therefore some configuration file editing will have to be done (unless you want to configure Loggers

Dynamic partition pruning

2015-10-16 Thread Younes Naguib
Hi all I'm running sqls on spark 1.5.1 and using tables based on parquets. My tables are not pruned when joined on partition columns. Ex: Select from tab where partcol=1 will prune on value 1 Select from tab join dim on (dim.partcol=tab.partcol) where dim.partcol=1 will scan all

Question of RDD in calculation

2015-10-16 Thread Shepherd
Hi all,I am new in Spark, and I have a question in dealing with RDD.I’ve converted RDD to DataFrame. So there are two DF: DF1 and DF2DF1 contains: userID, time, dataUsage, durationDF2 contains: userIDEach userID has multiple rows in DF1.DF2 has distinct userID, and I would like to compute the

Re: PySpark + Streaming + DataFrames

2015-10-16 Thread Jason White
Hi Ken, thanks for replying. Unless I'm misunderstanding something, I don't believe that's correct. Dstream.transform() accepts a single argument, func. func should be a function that accepts a single RDD, and returns a single RDD. That's what transform_to_df does, except the RDD it returns is a

Problems w/YARN Spark Streaming app reading from Kafka

2015-10-16 Thread Robert Towne
I have a Spark Streaming app that reads using a reciever-less connection ( KafkaUtils.createDirectStream) with an interval of 1 minute. For about 15 hours it was running fine, ranging in input size of 3,861,758 to 16,836 events. Then about 3 hours ago, every minute batch brought in the same

Problem of RDD in calculation

2015-10-16 Thread ChengBo
Hi all, I am new in Spark, and I have a question in dealing with RDD. I've converted RDD to DataFrame. So there are two DF: DF1 and DF2 DF1 contains: userID, time, dataUsage, duration DF2 contains: userID Each userID has multiple rows in DF1. DF2 has distinct userID, and I would like to compute

Re: Convert SchemaRDD to RDD

2015-10-16 Thread Ted Yu
bq. type mismatch found String required Serializable See line 110: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/lang/String.java#109 Can you pastebin the complete stack trace for the error you encountered ? Cheers On Fri, Oct 16, 2015 at 8:01 AM, satish

Re: Problems w/YARN Spark Streaming app reading from Kafka

2015-10-16 Thread Cody Koeninger
What do you mean by "the current documentation states it isn’t used"? http://spark.apache.org/docs/latest/configuration.html still lists the value and its meaning. As far as the issue you're seeing, are you measuring records by looking at logs, the spark ui, or actual downstream sinks of data?

RE: Dynamic partition pruning

2015-10-16 Thread Younes Naguib
Thanks, Do you have a Jira I can follow for this? y From: Michael Armbrust [mailto:mich...@databricks.com] Sent: October-16-15 2:18 PM To: Younes Naguib Cc: user@spark.apache.org Subject: Re: Dynamic partition pruning We don't support dynamic partition pruning yet. On Fri, Oct 16, 2015 at

RE: Spark SQL running totals

2015-10-16 Thread Stefan Panayotov
Thanks Deenar. This works perfectly. I can't test the solution with window functions because I am still on Spark 1.3.1 Hopefully will move to 1.5 soon. Stefan Panayotov Sent from my Windows Phone From: Deenar Toraskar Sent:

Clustering KMeans error in 1.5.1

2015-10-16 Thread robin_up
We upgraded from 1.4.0 to 1.5.1 (skipped 1.5.0) and one of our clustering job hit the below error. Does anyone know what this is about or if it is a bug? stdout4260Traceback (most recent call last): File "user_clustering.py", line 137, in uig_model = KMeans.train(uigs,i,nIter, runs =

How to speed up reading from file?

2015-10-16 Thread Saif.A.Ellafi
Hello, Is there an optimal number of partitions per number of rows, when writing into disk, so we can re-read later from source in a distributed way? Any thoughts? Thanks Saif

In-memory computing and cache() in Spark

2015-10-16 Thread Jia Zhan
Hi all, I am running Spark locally in one node and trying to sweep the memory size for performance tuning. The machine has 8 CPUs and 16G main memory, the dataset in my local disk is about 10GB. I have several quick questions and appreciate any comments. 1. Spark performs in-memory computing,

Streaming of COAP Resources

2015-10-16 Thread Sadaf
I am currently working on IOT Coap protocol.I accessed server on local host through copper firefox plugin. Then i Added resouce having "GET" functionality in server. After that i made its client as a streaming source. Here is the code of client streaming class customReceiver(test:String) extends

Re: Best practices to handle corrupted records

2015-10-16 Thread Erwan ALLAIN
Either[FailureResult[T], Either[SuccessWithWarnings[T], SuccessResult[T]]] maybe ? On Thu, Oct 15, 2015 at 5:31 PM, Antonio Murgia < antonio.murg...@studio.unibo.it> wrote: > 'Either' does not cover the case where the outcome was successful but > generated warnings. I already looked into it and

Re: Get the previous state string in Spark streaming

2015-10-16 Thread Tathagata Das
Its hard to help without any stacktrace associated with UnsupportedOperationException. On Thu, Oct 15, 2015 at 10:40 PM, Chandra Mohan, Ananda Vel Murugan < ananda.muru...@honeywell.com> wrote: > One of my co-worker(Yogesh) was trying to get this posted in spark mailing > and it seems it did not

Re: Best practices to handle corrupted records

2015-10-16 Thread Ravindra
+1 Erwan.. May be a trivial solution like this - class Result (msg: String, record: Record) class Success (msgSuccess: String, val msg: String, val record: Record) extends Result(msg, record) class Failure (msgFailure: String, val msg: String, val record: Record) extends Result (msg, record)

Ensuring eager evaluation inside mapPartitions

2015-10-16 Thread alberskib
Hi all, I am wondering whether there is way to ensure that two consecutive maps inside mapPartition will not be chained together. To illustrate my question I prepared short example: rdd.mapPartitions(it => { it.map(x => foo(x)).map(y => y.getResult) } I would like to ensure that foo

RE: Get the previous state string in Spark streaming

2015-10-16 Thread Chandra Mohan, Ananda Vel Murugan
Hi, Thanks for the response. We are trying to implement something similar as discussed in the following SFO post. http://stackoverflow.com/questions/27535668/spark-streaming-groupbykey-and-updatestatebykey-implementation We are doing it in java while accepted answer(second answer) in this post

Re: s3a file system and spark deployment mode

2015-10-16 Thread Steve Loughran
> On 15 Oct 2015, at 19:04, Scott Reynolds wrote: > > List, > > Right now we build our spark jobs with the s3a hadoop client. We do this > because our machines are only allowed to use IAM access to the s3 store. We > can build our jars with the s3a filesystem and the

Re: Best practices to handle corrupted records

2015-10-16 Thread Antonio Murgia
Unfortunately Either doesn’t accept 3 type parameters but only 2 so Either solution is not viable. My solution is pretty similar to Ravindra one. This “post” was to find out if there was a common and established solution to this problem, in the spark “world”. On Oct 16, 2015, at 11:05 AM,

Re: Get the previous state string in Spark streaming

2015-10-16 Thread Tathagata Das
A simple Javadoc look up in Java List shows that List.add can throw UnsupportedOperationException if the implementation of the List interface does not support that operation (example, does not support adding null). A good place

Re: Ensuring eager evaluation inside mapPartitions

2015-10-16 Thread Sean Owen
If you mean, getResult is called on the result of foo for each record, then that already happens. If you mean getResults is called only after foo has been called on all records, then you have to collect to a list, yes. Why does it help with foo being slow in either case though? You can try to

Re: Does Spark use more memory than MapReduce?

2015-10-16 Thread Gylfi
By default Spark will actually not keep the data at all, it will just store "how" to recreate the data. The programmer can however choose to keep the data once instantiated by calling "/.persist()/" or "/.cache()/" on the RDD. /.cache/ will store the data in-memory only and fail if it will not

Re: Ensuring eager evaluation inside mapPartitions

2015-10-16 Thread Bartłomiej Alberski
I mean getResults is called only after foo has been called on all records. It could be useful if foo is asynchronous call to external service returning Future that provide you some additional data i.e REST API (IO operations). If such API has latency of 100ms, sending all requests (for 1000

HBase Spark Streaming giving error after restore

2015-10-16 Thread Amit Singh Hora
Hi All, I am using below code to stream data from kafka to hbase ,everything works fine until i restart the job so that it can restore the state from checkpoint directory ,but while trying to restore the state it give me below error ge 0.0 (TID 0, localhost): java.lang.ClassCastException:

HBase Spark Streaming giving error after restore

2015-10-16 Thread Amit Singh Hora
Hi All, I am using below code to stream data from kafka to hbase ,everything works fine until i restart the job so that it can restore the state from checkpoint directory ,but while trying to restore the state it give me below error ge 0.0 (TID 0, localhost): java.lang.ClassCastException:

issue of tableau connect to spark sql 1.5

2015-10-16 Thread Wangfei (X)
Hi all! I test tableau(9.1.0 32bit) to read tables form spark sql(build from branch-1.5) using odbc. And found the following issue: # # "[Simba][SQLEngine] (31740) Table or view not found: SPARK.default.src # table "[default].[src]" not exist" and i found a very stange issue that if i

Convert SchemaRDD to RDD

2015-10-16 Thread satish chandra j
Hi All, To convert SchemaRDD to RDD below snipped is working if SQL statement has columns in a row are less than 22 as per tuple restriction rdd.map(row => row.toString) But if SQL statement has columns more than 22 than the above snippet will error "*object Tuple27 is not a member of package

Issue of jar dependency in yarn-cluster mode

2015-10-16 Thread Rex Xiong
Hi folks, In my spark application, executor task depends on snakeyaml-1.10.jar I build it with Maven and it works fine: spark-submit --master local --jars d:\snakeyaml-1.10.jar ... But when I try to run it in yarn, I have issue, it seems spark executor cannot find the jar file:

Re: HBase Spark Streaming giving error after restore

2015-10-16 Thread Ted Yu
Can you show the complete stack trace ? Subclass of Mutation is expected. Put is a subclass. Have you tried replacing BoxedUnit with Put in your code ? Cheers On Fri, Oct 16, 2015 at 6:02 AM, Amit Singh Hora wrote: > Hi All, > > I am using below code to stream data from

HTTP 500 if try to access Spark UI in yarn-cluster (only)

2015-10-16 Thread Sebastian YEPES FERNANDEZ
​Hello, I am wondering if anyone else is also facing this ​issue: https://issues.apache.org/jira/browse/SPARK-11147 ​

Re: Convert SchemaRDD to RDD

2015-10-16 Thread Ted Yu
Have you seen this thread ? http://search-hadoop.com/m/q3RTt9YBFr17u8j8=Scala+Limitation+Case+Class+definition+with+more+than+22+arguments On Fri, Oct 16, 2015 at 7:41 AM, satish chandra j wrote: > Hi All, > To convert SchemaRDD to RDD below snipped is working if SQL

Compiling spark 1.5.1 fails with scala.reflect.internal.Types$TypeError: bad symbolic reference.

2015-10-16 Thread Simon Hafner
Fresh clone of spark 1.5.1, java version "1.7.0_85" build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package [error] bad symbolic reference. A signature in WebUI.class refers to term eclipse [error] in package org which is not available. [error] It may be completely missing

Re: Convert SchemaRDD to RDD

2015-10-16 Thread satish chandra j
HI Ted, I have implemented the below snipped but getting an error"type mismatch found String required Serializable" as mentioned in mail chain class MyRecord(val val1: String, val val2: String, ... more then 22, in this case f.e. 26) extends Product with Serializable { def canEqual(that:

Re: s3a file system and spark deployment mode

2015-10-16 Thread Scott Reynolds
hmm I tried using --jars and that got passed to MasterArguments and that doesn't work :-( https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/deploy/master/MasterArguments.scala Same with Worker:

Re: Issue of jar dependency in yarn-cluster mode

2015-10-16 Thread Rex Xiong
I resolve this issue finally by adding --conf spark.executor.extraClassPath= snakeyaml-1.10.jar 2015-10-16 22:57 GMT+08:00 Rex Xiong : > Hi folks, > > In my spark application, executor task depends on snakeyaml-1.10.jar > I build it with Maven and it works fine: >

Accessing HDFS HA from spark job (UnknownHostException error)

2015-10-16 Thread kyarovoy
I have Apache Mesos 0.22.1 cluster (3 masters & 5 slaves), running Cloudera HDFS (2.5.0-cdh5.3.1) in HA configuration and Spark 1.5.1 framework. When I try to spark-submit compiled HdfsTest.scala example app (from Spark 1.5.1 sources) - it fails with "java.lang.IllegalArgumentException:

Re: Dynamic partition pruning

2015-10-16 Thread Xiao Li
Hi, Younes, Maybe you can open a JIRA? Thanks, Xiao Li 2015-10-16 12:43 GMT-07:00 Younes Naguib : > Thanks, > > Do you have a Jira I can follow for this? > > > > y > > > > *From:* Michael Armbrust [mailto:mich...@databricks.com] > *Sent:* October-16-15 2:18 PM

Multiple joins in Spark

2015-10-16 Thread Shyam Parimal Katti
Hello All, I have a following SQL query like this: select a.a_id, b.b_id, c.c_id from table_a a join table_b b on a.a_id = b.a_id join table_c c on b.b_id = c.b_id In scala i have done this so far: table_a_rdd = sc.textFile(...) table_b_rdd = sc.textFile(...) table_c_rdd = sc.textFile(...)

Re: Problem of RDD in calculation

2015-10-16 Thread Xiao Li
Hi, Frank, After registering these DF as a temp table (via the API registerTempTable), you can do it using SQL. I believe this should be much easier. Good luck, Xiao Li 2015-10-16 12:10 GMT-07:00 ChengBo : > Hi all, > > > > I am new in Spark, and I have a question in

driver ClassNotFoundException when MySQL JDBC exceptions are thrown on executor

2015-10-16 Thread Hurshal Patel
Hi all, I've been struggling with a particularly puzzling issue after upgrading to Spark 1.5.1 from Spark 1.4.1. When I use the MySQL JDBC connector and an exception (e.g. com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException) is thrown on the executor, I get a ClassNotFoundException on the

Location preferences in pyspark?

2015-10-16 Thread Philip Weaver
I believe what I want is the exact functionality provided by SparkContext.makeRDD in Scala. For each element in the RDD, I want specify a list of preferred hosts for processing that element. It looks like this method only exists in Scala, and as far as I can tell there is no similar functionality

How to have Single refernce of a class in Spark Streaming?

2015-10-16 Thread swetha
Hi, How to have a single reference of a class across all the executors in Spark Streaming? The contents of the class will be updated at all the executors. Would using it as a variable inside updateStateByKey guarantee that reference is updated across all the executors and no

Re: How to put an object in cache for ever in Streaming

2015-10-16 Thread swetha kasireddy
What about cleaning up the tempData that gets generated by shuffles. We have a lot of temp data that gets generated by shuffles in /tmp folder. That's why we are using ttl. Also if I keep an RDD in cache is it available across all the executors or just the same executor? On Fri, Oct 16, 2015 at

Re: Multiple joins in Spark

2015-10-16 Thread Xiao Li
Hi, Shyam, You still can use SQL to do the same thing in Spark: For example, val df1 = sqlContext.createDataFrame(rdd) val df2 = sqlContext.createDataFrame(rdd2) val df3 = sqlContext.createDataFrame(rdd3) df1.registerTempTable("tab1") df2.registerTempTable("tab2")

Re: How to put an object in cache for ever in Streaming

2015-10-16 Thread Tathagata Das
Setting a ttl is not recommended any more as Spark works with Java GC to clean up stuff (RDDs, shuffles, broadcasts,etc.) that are not in reference any more. So you can keep an RDD cached in Spark, and every minute uncache the previous one, and cache a new one. TD On Fri, Oct 16, 2015 at 12:02

Re: Problem of RDD in calculation

2015-10-16 Thread Xiao Li
For most programmers, dataFrames are preferred thanks to the flexibility, but using sql syntax is a great option for users who feel more comfortable using SQL. : ) 2015-10-16 18:22 GMT-07:00 Ali Tajeldin EDU : > Since DF2 only has the userID, I'm assuming you are musing

Re: Multiple joins in Spark

2015-10-16 Thread Xiao Li
Hi, Shyam, The method registerTempTable is to register a [DataFrame as a temporary table in the Catalog using the given table name. In the Catalog, Spark maintains a concurrent hashmap, which contains the pair of the table names and the logical plan. For example, when we submit the following

Re: Problem of RDD in calculation

2015-10-16 Thread Ali Tajeldin EDU
Since DF2 only has the userID, I'm assuming you are musing DF2 to filter for desired userIDs. You can just use the join() and groupBy operations on DataFrame to do what you desire. For example: scala> val df1=app.createDF("id:String; v:Integer", "X,1;X,2;Y,3;Y,4;Z,10") df1:

Re: Spark on Mesos / Executor Memory

2015-10-16 Thread Bharath Ravi Kumar
Can someone respond if you're aware of the reason for such a memory footprint? It seems unintuitive and hard to reason about. Thanks, Bharath On Thu, Oct 15, 2015 at 12:29 PM, Bharath Ravi Kumar wrote: > Resending since user@mesos bounced earlier. My apologies. > > On Thu,

Re: How to speed up reading from file?

2015-10-16 Thread Xiao Li
Hi, Saif, The optimal number of rows per partition depends on many factors, right? for example, your row size, your file system configuration, your replication configuration and the performance of your underlying hardware. The best way is to do the performance testing and tuning your