Re: Spark Streaming - Inserting into Tables

2015-07-12 Thread Brandon White
Hi Yin, Yes there were no new rows. I fixed it by doing a .remember on the context. Obviously, this is not ideal. On Sun, Jul 12, 2015 at 6:31 PM, Yin Huai yh...@databricks.com wrote: Hi Brandon, Can you explain what did you mean by It simply does not work? You did not see new data files?

Re: SparkSQL 'describe table' tries to look at all records

2015-07-12 Thread Jerrick Hoang
Sorry all for not being clear. I'm using spark 1.4 and the table is a hive table, and the table is partitioned. On Sun, Jul 12, 2015 at 6:36 PM, Yin Huai yh...@databricks.com wrote: Jerrick, Let me ask a few clarification questions. What is the version of Spark? Is the table a hive table?

Re: Ordering of Batches in Spark streaming

2015-07-12 Thread anshu shukla
Anyone who can give some highlight over HOW SPARK DOES *ORDERING OF BATCHES * . On Sat, Jul 11, 2015 at 9:19 AM, anshu shukla anshushuk...@gmail.com wrote: Thanks Ayan , I was curious to know* how Spark does it *.Is there any *Documentation* where i can get the detail about that . Will

Re: How to upgrade Spark version in CDH 5.4

2015-07-12 Thread Sean Owen
Yeah, it won't technically be supported, and you shouldn't go modifying the actual installation, but if you just make your own build of 1.4 for CDH 5.4 and use that build to launch YARN-based apps, I imagine it will Just Work for most any use case. On Sun, Jul 12, 2015 at 7:34 PM, Ruslan

Re: Does spark supports the Hive function posexplode function?

2015-07-12 Thread David Sabater Dinter
It seems this feature was added in Hive 0.13. https://issues.apache.org/jira/browse/HIVE-4943 I would assume this is supported as Spark is by default compiled using Hive 0.13.1. On Sun, Jul 12, 2015 at 7:42 PM, Ruslan Dautkhanov dautkha...@gmail.com wrote: You can see what Spark SQL functions

Re: Spark equivalent for Oracle's analytical functions

2015-07-12 Thread Ruslan Dautkhanov
Should be part of Spark 1.4 https://issues.apache.org/jira/browse/SPARK-1442 I don't see it in the documentation though https://spark.apache.org/docs/latest/sql-programming-guide.html -- Ruslan Dautkhanov On Mon, Jul 6, 2015 at 5:06 AM, gireeshp gireesh.puthum...@augmentiq.in wrote: Is

Calculating moving average of dataset in Apache Spark and Scala

2015-07-12 Thread Anupam Bagchi
I have to do the following tasks on a dataset using Apache Spark with Scala as the programming language: Read the dataset from HDFS. A few sample lines look like this: deviceid,bytes,eventdate 15590657,246620,20150630 14066921,1907,20150621 14066921,1906,20150626 6522013,2349,20150626

Moving average using Spark and Scala

2015-07-12 Thread Anupam Bagchi
I have to do the following tasks on a dataset using Apache Spark with Scala as the programming language: Read the dataset from HDFS. A few sample lines look like this: deviceid,bytes,eventdate 15590657,246620,20150630 14066921,1907,20150621 14066921,1906,20150626 6522013,2349,20150626

Re: Starting Spark-Application without explicit submission to cluster?

2015-07-12 Thread Akhil Das
Yes, that is correct. You can use this boiler plate to avoid spark-submit. //The configurations val sconf = new SparkConf() .setMaster(spark://spark-ak-master:7077) .setAppName(SigmoidApp) .set(spark.serializer, org.apache.spark.serializer.KryoSerializer)

Re: Issues when combining Spark and a third party java library

2015-07-12 Thread Akhil Das
Did you try setting the HADOOP_CONF_DIR? Thanks Best Regards On Sat, Jul 11, 2015 at 3:17 AM, maxdml maxdemou...@gmail.com wrote: Also, it's worth noting that I'm using the prebuilt version for hadoop 2.4 and higher from the official website. -- View this message in context:

Re: Linear search between particular log4j log lines

2015-07-12 Thread Akhil Das
Can you not use sc.wholeTextFile() and use a custom parser or a regex to extract out the TransactionIDs? Thanks Best Regards On Sat, Jul 11, 2015 at 8:18 AM, ssbiox sergey.korytni...@gmail.com wrote: Hello, I have a very specific question on how to do a search between particular lines of

Re: Worker dies with java.io.IOException: Stream closed

2015-07-12 Thread Akhil Das
Can you dig a bit more in the worker logs? Also make sure that spark has permission to write to /opt/ on that machine as its one machine always throwing up. Thanks Best Regards On Sat, Jul 11, 2015 at 11:18 PM, gaurav sharma sharmagaura...@gmail.com wrote: Hi All, I am facing this issue in

Re: How to upgrade Spark version in CDH 5.4

2015-07-12 Thread David Sabater Dinter
As Sean suggested you can actually build Spark 1.4 for CDH 5.4.x and also include Hive libraries for 0.13.1, but *this will be completely unsupported by Cloudera*. I would suggest to do that only if you just want to experiment with new features from Spark 1.4. I.e. Run SparkSQL with sort-merge

Re: SparkSQL 'describe table' tries to look at all records

2015-07-12 Thread Ted Yu
Which Spark release do you use ? Cheers On Sun, Jul 12, 2015 at 5:03 PM, Jerrick Hoang jerrickho...@gmail.com wrote: Hi all, I'm new to Spark and this question may be trivial or has already been answered, but when I do a 'describe table' from SparkSQL CLI it seems to try looking at all

Re: SparkSQL 'describe table' tries to look at all records

2015-07-12 Thread ayan guha
Describe computes statistics, so it will try to query the table. The one you are looking for is df.printSchema() On Mon, Jul 13, 2015 at 10:03 AM, Jerrick Hoang jerrickho...@gmail.com wrote: Hi all, I'm new to Spark and this question may be trivial or has already been answered, but when I do

Re: Spark Streaming - Inserting into Tables

2015-07-12 Thread Yin Huai
Hi Brandon, Can you explain what did you mean by It simply does not work? You did not see new data files? Thanks, Yin On Fri, Jul 10, 2015 at 11:55 AM, Brandon White bwwintheho...@gmail.com wrote: Why does this not work? Is insert into broken in 1.3.1? It does not throw any errors, fail, or

Re: SparkSQL 'describe table' tries to look at all records

2015-07-12 Thread Yin Huai
Jerrick, Let me ask a few clarification questions. What is the version of Spark? Is the table a hive table? What is the format of the table? Is the table partitioned? Thanks, Yin On Sun, Jul 12, 2015 at 6:01 PM, ayan guha guha.a...@gmail.com wrote: Describe computes statistics, so it will

Re: Does spark supports the Hive function posexplode function?

2015-07-12 Thread ayan guha
Spark already provides an explode function on lateral views. Please see https://issues.apache.org/jira/browse/SPARK-5573. On Mon, Jul 13, 2015 at 6:47 AM, David Sabater Dinter david.sabater.maill...@gmail.com wrote: It seems this feature was added in Hive 0.13.

SparkSQL 'describe table' tries to look at all records

2015-07-12 Thread Jerrick Hoang
Hi all, I'm new to Spark and this question may be trivial or has already been answered, but when I do a 'describe table' from SparkSQL CLI it seems to try looking at all records at the table (which takes a really long time for big table) instead of just giving me the metadata of the table. Would

Spark Standalone Mode not working in a cluster

2015-07-12 Thread Eduardo
My installation of spark is not working correctly in my local cluster. I downloaded spark-1.4.0-bin-hadoop2.6.tgz and untar it in a directory visible to all nodes (these nodes are all accessible by ssh without password). In addition, I edited conf/slaves so that it contains the names of the nodes.

Re: RECEIVED SIGNAL 15: SIGTERM

2015-07-12 Thread Jong Wook Kim
Based on my experience, YARN containers can get SIGTERM when - it produces too much logs and use up the hard drive - it uses off-heap memory more than what is given by spark.yarn.executor.memoryOverhead configuration. It might be due to too many classes loaded (less than MaxPermGen but more

Re: RECEIVED SIGNAL 15: SIGTERM

2015-07-12 Thread Ruslan Dautkhanov
the executor receives a SIGTERM (from whom???) From YARN Resource Manager. Check if yarn fair scheduler preemption and/or speculative execution are turned on, then it's quite possible and not a bug. -- Ruslan Dautkhanov On Sun, Jul 12, 2015 at 11:29 PM, Jong Wook Kim jongw...@nyu.edu

Few basic spark questions

2015-07-12 Thread Oded Maimon
Hi All, we are evaluating spark for real-time analytic. what we are trying to do is the following: - READER APP- use custom receiver to get data from rabbitmq (written in scala) - ANALYZER APP - use spark R application to read the data (windowed), analyze it every minute and save the

Re: S3 vs HDFS

2015-07-12 Thread Steve Loughran
On 11 Jul 2015, at 19:20, Aaron Davidson ilike...@gmail.commailto:ilike...@gmail.com wrote: Note that if you use multi-part upload, each part becomes 1 block, which allows for multiple concurrent readers. One would typically use fixed-size block sizes which align with Spark's default HDFS

Re: Spark performance

2015-07-12 Thread santoshv98
Ravi Spark (or in that case Big Data solutions like Hive) is suited for large analytical loads, where the “scaling up” starts to pale in comparison to “Scaling out” with regards to performance, versatility(types of data) and cost. Without going into the details of MsSQL architecture, there

Re: createDirectStream and Stats

2015-07-12 Thread gaurav sharma
Hi guys, I too am facing similar challenge with directstream. I have 3 Kafka Partitions. and running spark on 18 cores, with parallelism level set to 48. I am running simple map-reduce job on incoming stream. Though the reduce stage takes milliseconds-seconds for around 15 million packets,

Including additional scala libraries in sparkR

2015-07-12 Thread Michal Haris
I have spark program with a custom optimised rdd for hbase scans and updates. I have a small library of objects in scala to support efficient serialisation, partitioning etc. I would like to use R as an analysis and visualisation front-end. I have tried to use rJava (i.e. not using sparkR) and I

Re: Worker dies with java.io.IOException: Stream closed

2015-07-12 Thread gaurav sharma
the logs i pasted are from worker logs only, spark does have permission to write into /opt, its not like the worker is not able to startit runs perfectly for days, but then abruptly dies. and its not always this machine, sometimes its some other machine. It happens once in a while, but

Re: Is it possible to change the default port number 7077 for spark?

2015-07-12 Thread maxdml
Q1: You can change the port number on the master in the file conf/spark-defaults.conf. I don't know what will be the impact on a cloudera distro thought. Q2: Yes: a Spark worker needs to be present on each node which you want to make available to the driver. Q3: You can submit an application

Re: Does spark supports the Hive function posexplode function?

2015-07-12 Thread Ruslan Dautkhanov
You can see what Spark SQL functions are supported in Spark by doing the following in a notebook: %sql show functions https://forums.databricks.com/questions/665/is-hive-coalesce-function-supported-in-sparksql.html I think Spark SQL support is currently around Hive ~0.11? -- Ruslan

How can the RegressionMetrics produce negative R2 and explained variance?

2015-07-12 Thread afarahat
Hello; I am using the ALS recommendation MLLibb. To select the optimal rank, I have a number of users who used multiple items as my test. I then get the prediction on these users and compare it to the observed. I use the RegressionMetrics to estimate the R^2. I keep getting a negative value.

Re: Issues when combining Spark and a third party java library

2015-07-12 Thread Max Demoulin
Yes, Thank you. -- Henri Maxime Demoulin 2015-07-12 2:53 GMT-04:00 Akhil Das ak...@sigmoidanalytics.com: Did you try setting the HADOOP_CONF_DIR? Thanks Best Regards On Sat, Jul 11, 2015 at 3:17 AM, maxdml maxdemou...@gmail.com wrote: Also, it's worth noting that I'm using the prebuilt

Re: Caching in spark

2015-07-12 Thread Ruslan Dautkhanov
Hi Akhil, It's interesting if RDDs are stored internally in a columnar format as well? Or it is only when an RDD is cached in SQL context, it is converted to columnar format. What about data frames? Thanks! -- Ruslan Dautkhanov On Fri, Jul 10, 2015 at 2:07 AM, Akhil Das

Re: How can the RegressionMetrics produce negative R2 and explained variance?

2015-07-12 Thread Sean Owen
In general, R2 means the line that was fit is a very poor fit -- the mean would give a smaller squared error. But it can also mean you are applying R2 where it doesn't apply. Here, you're not performing a linear regression; why are you using R2? On Sun, Jul 12, 2015 at 4:22 PM, afarahat

Re: Real-time data visualization with Zeppelin

2015-07-12 Thread andy petrella
Heya, You might be looking for something like this I guess: https://www.youtube.com/watch?v=kB4kRQRFAVc. The Spark-Notebook (https://github.com/andypetrella/spark-notebook/) can bring that to you actually, it uses fully reactive bilateral communication streams to update data and viz, plus it

Re: How can the RegressionMetrics produce negative R2 and explained variance?

2015-07-12 Thread Feynman Liang
This might be a bug... R^2 should always be in [0,1] and variance should never be negative. Can you give more details on which version of Spark you are running? On Sun, Jul 12, 2015 at 8:37 AM, Sean Owen so...@cloudera.com wrote: In general, R2 means the line that was fit is a very poor fit --