Querying JSON in Spark SQL

2015-03-16 Thread Fatma Ozcan
Is there any documentation that explains how to query JSON documents using SparkSQL? Thanks, Fatma

Re: Using TF-IDF from MLlib

2015-03-16 Thread Shivaram Venkataraman
FWIW the JIRA I was thinking about is https://issues.apache.org/jira/browse/SPARK-3098 On Mon, Mar 16, 2015 at 6:10 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I vaguely remember that JIRA and AFAIK Matei's point was that the order is not guaranteed *after* a shuffle. If you

question regarding the dependency DAG in Spark

2015-03-16 Thread Grandl Robert
Hi guys, I am trying to get a better understanding of the DAG generation for a job in Spark. Ideally, what I want is to run some SQL query and extract the generated DAG by Spark. By DAG I mean the stages and dependencies among stages, and the number of tasks in every stage. Could you guys

Re: Using TF-IDF from MLlib

2015-03-16 Thread Joseph Bradley
This was brought up again in https://issues.apache.org/jira/browse/SPARK-6340 so I'll answer one item which was asked about the reliability of zipping RDDs. Basically, it should be reliable, and if it is not, then it should be reported as a bug. This general approach should work (with explicit

Re: order preservation with RDDs

2015-03-16 Thread kian.ho
For those still interested, I raised this issue on JIRA and received an official response: https://issues.apache.org/jira/browse/SPARK-6340 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/order-preservation-with-RDDs-tp22052p22088.html Sent from the Apache

Re: Using TF-IDF from MLlib

2015-03-16 Thread Sean Owen
Dang I can't seem to find the JIRA now but I am sure we had a discussion with Matei about this and the conclusion was that RDD order is not guaranteed unless a sort is involved. On Mar 17, 2015 12:14 AM, Joseph Bradley jos...@databricks.com wrote: This was brought up again in

version conflict common-net

2015-03-16 Thread Jacob Abraham
Hi Folks, I have a situation where I am getting a version conflict between java libraries that is used by my application and ones used by spark. Following are the details - I use spark provided by Cloudera running on the CDH5.3.2 cluster (Spark 1.2.0-cdh5.3.2). The library that is causing the

Spark from S3 very slow

2015-03-16 Thread Pere Kyle
I am seeing extremely slow performance from Spark 1.2.1 (MAPR4) on Hadoop 2.5.1 (YARN) on hive external tables on s3n. I am running a 'select count(*) from s3_table' query on the nodes using Hive 0.13 and Spark SQL 1.2.1. I am running a 5 node cluster on EC2 c3.2xlarge Mapr 4.0.2 M3 cluster. The

Garbage stats in Random Forest leaf node?

2015-03-16 Thread cjwang
I dumped the trees in the random forest model, and occasionally saw a leaf node with strange stats: - pred=1.00 prob=0.80 imp=-1.00

Suggestion for user logging

2015-03-16 Thread Xi Shen
Hi, When you submit a jar to the spark cluster, it is very difficult to see the logging. Is there any way to save the logging to a file? I mean only the logging I created not the Spark log information. Thanks, David

Re: Using TF-IDF from MLlib

2015-03-16 Thread Shivaram Venkataraman
I vaguely remember that JIRA and AFAIK Matei's point was that the order is not guaranteed *after* a shuffle. If you only use operations like map which preserve partitioning, ordering should be guaranteed from what I know. On Mon, Mar 16, 2015 at 6:06 PM, Sean Owen so...@cloudera.com wrote: Dang

Re: why generateJob is a private API?

2015-03-16 Thread Tathagata Das
It was not really meant to be pubic and overridden. Because anything you want to do to generate jobs from RDDs can be done using DStream.foreachRDD On Sun, Mar 15, 2015 at 11:14 PM, madhu phatak phatak@gmail.com wrote: Hi, I am trying to create a simple subclass of DStream. If I

Re: [SPARK-3638 ] java.lang.NoSuchMethodError: org.apache.http.impl.conn.DefaultClientConnectionOperator.

2015-03-16 Thread Ted Yu
See this thread: http://search-hadoop.com/m/JW1q5Kk8Zs1 You can find Spark built against multiple hadoop releases in: http://people.apache.org/~pwendell/spark-1.3.0-rc3/ FYI On Mon, Mar 16, 2015 at 11:36 AM, Shuai Zheng szheng.c...@gmail.com wrote: And it is an NoSuchMethodError, not a

Re: Process time series RDD after sortByKey

2015-03-16 Thread Imran Rashid
Hi Shuai, yup, that is exactly what I meant -- implement your own class MyGroupingRDD. This is definitely more detail than a lot of users will need to go, but its also not all that scary either. In this case, you want something that is *extremely* close to the existing CoalescedRDD, so start by

Re: ClassNotFoundException

2015-03-16 Thread Kevin (Sangwoo) Kim
Hi Ralph, It seems like https://issues.apache.org/jira/browse/SPARK-6299 issue, which is I'm working on. I submitted a PR for it, would you test it? Regards, Kevin On Tue, Mar 17, 2015 at 1:11 AM Ralph Bergmann ra...@dasralph.de wrote: Hi, I want to try the JavaSparkPi example[1] on a

Re: Spark 1.3 createDataframe error with pandas df

2015-03-16 Thread kevindahl
kevindahl wrote I'm trying to create a spark data frame from a pandas data frame, but for even the most trivial of datasets I get an error along the lines of this: --- Py4JJavaError Traceback

RE: How to set Spark executor memory?

2015-03-16 Thread jishnu.prathap
Hi Xi Shen, You could set the spark.executor.memory in the code itself . new SparkConf()..set(spark.executor.memory, 2g) Or you can try the -- spark.executor.memory 2g while submitting the jar. Regards Jishnu Prathap From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Monday, March 16,

HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-03-16 Thread Bharath Ravi Kumar
Hi, Trying to run spark ( 1.2.1 built for hdp 2.2) against a yarn cluster results in the AM failing to start with following error on stderr: Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher An application id was assigned to the job, but there were no logs.

Re: k-means hang without error/warning

2015-03-16 Thread Xi Shen
Hi Sean, My system is windows 64 bit. I looked into the resource manager, Java is the only process that used about 13% CPU recourse; no disk activity related to Java; only about 6GB memory used out of 56GB in total. My system response very well. I don't think it is a system issue. Thanks, David

Re: Can I start multiple executors in local mode?

2015-03-16 Thread xu Peng
Hi David, You can try the local-cluster. the number in local-cluster[2,2,1024] represents that there are 2 worker, 2 cores and 1024M Best Regards Peng Xu 2015-03-16 19:46 GMT+08:00 Xi Shen davidshe...@gmail.com: Hi, In YARN mode you can specify the number of executors. I wonder if we can

Re: unable to access spark @ spark://debian:7077

2015-03-16 Thread Ralph Bergmann
I can access the manage webpage at port 8080 from my mac and it told me that master and 1 slave is running and I can access them at port 7077 But the port scanner shows that port 8080 is open but not port 7077. I started the port scanner on the same machine where Spark is running. Ralph Am

insert hive partitioned table

2015-03-16 Thread patcharee
Hi, I tried to insert into a hive partitioned table val ZONE: Int = Integer.valueOf(args(2)) val MONTH: Int = Integer.valueOf(args(3)) val YEAR: Int = Integer.valueOf(args(4)) val weightedUVToDF = weightedUVToRecord.toDF() weightedUVToDF.registerTempTable(speeddata) hiveContext.sql(INSERT

Re: Iterative Algorithms with Spark Streaming

2015-03-16 Thread Nick Pentreath
MLlib supports streaming linear models: http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression and k-means: http://spark.apache.org/docs/latest/mllib-clustering.html#k-means With an iteration parameter of 1, this amounts to mini-batch SGD where the mini-batch is

Re: insert hive partitioned table

2015-03-16 Thread Cheng Lian
Not quite sure whether I understand your question properly. But if you just want to read the partition columns, it’s pretty easy. Take the “year” column as an example, you may do this in HiveQL: |hiveContext.sql(SELECT year FROM speed) | or in DataFrame DSL:

Re: insert hive partitioned table

2015-03-16 Thread patcharee
I would like to insert the table, and the value of the partition column to be inserted must be from temporary registered table/dataframe. Patcharee On 16. mars 2015 15:26, Cheng Lian wrote: Not quite sure whether I understand your question properly. But if you just want to read the

Re: MappedStream vs Transform API

2015-03-16 Thread madhu phatak
Hi, Thanks for the response. I understand that part. But I am asking why the internal implementation using a subclass when it can use an existing api? Unless there is a real difference, it feels like code smell to me. Regards, Madhukara Phatak http://datamantra.io/ On Mon, Mar 16, 2015 at 2:14

Re: Processing of text file in large gzip archive

2015-03-16 Thread Marius Soutier
1. I don't think textFile is capable of unpacking a .gz file. You need to use hadoopFile or newAPIHadoop file for this. Sorry that’s incorrect, textFile works fine on .gz files. What it can’t do is compute splits on gz files, so if you have a single file, you'll have a single partition.

Re: Does spark-1.3.0 support the analytic functions defined in Hive, such as row_number, rank

2015-03-16 Thread Arush Kharbanda
You can track the issue here. https://issues.apache.org/jira/browse/SPARK-1442 Its currently not supported, i guess the test cases are work in progress. On Mon, Mar 16, 2015 at 12:44 PM, hseagle hsxup...@gmail.com wrote: Hi all, I'm wondering whether the latest spark-1.3.0 supports

Parquet and repartition

2015-03-16 Thread Masf
Hi all. When I specify the number of partitions and save this RDD in parquet format, my app fail. For example selectTest.coalesce(28).saveAsParquetFile(hdfs://vm-clusterOutput) However, it works well if I store data in text selectTest.coalesce(28).saveAsTextFile(hdfs://vm-clusterOutput) My

Error when using multiple python files spark-submit

2015-03-16 Thread poiuytrez
I have a spark app which is composed of multiple files. When I launch Spark using: ../hadoop/spark-install/bin/spark-submit main.py --py-files /home/poiuytrez/naive.py,/home/poiuytrez/processing.py,/home/poiuytrez/settings.py --master spark://spark-m:7077 I am getting an error:

Can I start multiple executors in local mode?

2015-03-16 Thread Xi Shen
Hi, In YARN mode you can specify the number of executors. I wonder if we can also start multiple executors at local, just to make the test run faster. Thanks, David

Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Shixiong Zhu
There are 2 cases for No space left on device: 1. Some tasks which use large temp space cannot run in any node. 2. The free space of datanodes is not balance. Some tasks which use large temp space can not run in several nodes, but they can run in other nodes successfully. Because most of our

Re: jar conflict with Spark default packaging

2015-03-16 Thread Shawn Zheng
Thanks a lot. I will give a try! On Monday, March 16, 2015, Adam Lewandowski adam.lewandow...@gmail.com wrote: Prior to 1.3.0, Spark has 'spark.files.userClassPathFirst' for non-yarn apps. For 1.3.0, use 'spark.executor.userClassPathFirst'. See

Re: How to set Spark executor memory?

2015-03-16 Thread Sean Owen
There are a number of small misunderstandings here. In the first instance, the executor memory is not actually being set to 2g and the default of 512m is being used. If you are writing code to launch an app, then you are trying to duplicate what spark-submit does, and you don't use spark-submit.

Re: configure number of cached partition in memory on SparkSQL

2015-03-16 Thread Cheng Lian
Hi Judy, In the case of |HadoopRDD| and |NewHadoopRDD|, partition number is actually decided by the |InputFormat| used. And |spark.sql.inMemoryColumnarStorage.batchSize| is not related to partition number, it controls the in-memory columnar batch size within a single partition. Also, what

Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
I created a JIRA: https://issues.apache.org/jira/browse/SPARK-6353 On Mon, Mar 16, 2015 at 5:36 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, We're facing No space left on device errors lately from time to time. The job will fail after retries. Obvious in such case, retry won't be

Re: Parquet and repartition

2015-03-16 Thread Masf
Thanks Sean, I forgot it The ouput error is the following: java.lang.ClassCastException: scala.math.BigDecimal cannot be cast to org.apache.spark.sql.catalyst.types.decimal.Decimal at org.apache.spark.sql.parquet.MutableRowWriteSupport.consumeType(ParquetTableSupport.scala:359) at

Re: k-means hang without error/warning

2015-03-16 Thread Sean Owen
I think you'd have to say more about stopped working. Is the GC thrashing? does the UI respond? is the CPU busy or not? On Mon, Mar 16, 2015 at 4:25 AM, Xi Shen davidshe...@gmail.com wrote: Hi, I am running k-means using Spark in local mode. My data set is about 30k records, and I set the k =

Re: unable to access spark @ spark://debian:7077

2015-03-16 Thread Sean Owen
Are you sure the master / slaves started? Do you have network connectivity between the two? Do you have multiple interfaces maybe? Does debian resolve correctly and as you expect to the right host/interface? On Mon, Mar 16, 2015 at 8:14 AM, Ralph Bergmann ra...@dasralph.de wrote: Hi, I try my

Iterative Algorithms with Spark Streaming

2015-03-16 Thread Alex Minnaar
I wanted to ask a basic question about the types of algorithms that are possible to apply to a DStream with Spark streaming. With Spark it is possible to perform iterative computations on RDDs like in the gradient descent example val points = spark.textFile(...).map(parsePoint).cache()

Re: How to preserve/preset partition information when load time series data?

2015-03-16 Thread Imran Rashid
Hi Shuai, It should certainly be possible to do it that way, but I would recommend against it. If you look at HadoopRDD, its doing all sorts of little book-keeping that you would most likely want to mimic. eg., tracking the number of bytes records that are read, setting up all the hadoop

RDD to DataFrame for using ALS under org.apache.spark.ml.recommendation.ALS

2015-03-16 Thread jaykatukuri
Hi all, I am trying to use the new ALS implementation under org.apache.spark.ml.recommendation.ALS. The new method to invoke for training seems to be override def fit(dataset: DataFrame, paramMap: ParamMap): ALSModel. How do I create a dataframe object from ratings data set that is on hdfs ?

Re: Problem connecting to HBase

2015-03-16 Thread HARIPRIYA AYYALASOMAYAJULA
Hello Ted, Yes, I can understand what you are suggesting. But I am unable to decipher where I am going wrong, could you please point out what are the locations to be looked at to be able to find and correct the mistake? I greatly appreciate your help! On Sun, Mar 15, 2015 at 1:10 PM, Ted Yu

ClassNotFoundException

2015-03-16 Thread Ralph Bergmann
Hi, I want to try the JavaSparkPi example[1] on a remote Spark server but I get a ClassNotFoundException. When I run it local it works but not remote. I added the spark-core lib as dependency. Do I need more? Any ideas? Thanks Ralph [1] ...

Re: Parquet and repartition

2015-03-16 Thread Cheng Lian
Hey Masf, I’ve created SPARK-6360 https://issues.apache.org/jira/browse/SPARK-6360 to track this issue. Detailed analysis is provided there. The TL;DR is, for Spark 1.1 and 1.2, if a SchemaRDD contains decimal or UDT column(s), after applying any traditional RDD transformations (e.g.

[SPARK-3638 ] java.lang.NoSuchMethodError: org.apache.http.impl.conn.DefaultClientConnectionOperator.

2015-03-16 Thread Shuai Zheng
Hi All, I am running Spark 1.2.1 and AWS SDK. To make sure AWS compatible on the httpclient 4.2 (which I assume spark use?), I have already downgrade to the version 1.9.0 But even that, I still got an error: Exception in thread main java.lang.NoSuchMethodError:

Re: unable to access spark @ spark://debian:7077

2015-03-16 Thread Ralph Bergmann
Okay I think I found the mistake The Eclipse Maven plug suggested me version 1.2.1 of the spark-core lib but I use Spark 1.3.0 As I fixed it I can access the Spark server. Ralph Am 16.03.15 um 14:39 schrieb Ralph Bergmann: I can access the manage webpage at port 8080 from my mac and it told

Re: Process time series RDD after sortByKey

2015-03-16 Thread Imran Rashid
Hi Shuai, On Sat, Mar 14, 2015 at 11:02 AM, Shawn Zheng szheng.c...@gmail.com wrote: Sorry I response late. Zhan Zhang's solution is very interesting and I look at into it, but it is not what I want. Basically I want to run the job sequentially and also gain parallelism. So if possible, if

Priority queue in spark

2015-03-16 Thread abhi
Hi Current all the jobs in spark gets submitted using queue . i have a requirement where submitted job will generate another set of jobs with some priority , which should again be submitted to spark cluster based on priority ? Means job with higher priority should be executed first,Is it

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-03-16 Thread Bharath Ravi Kumar
Hi Todd, Thanks for the help. I'll try again after building a distribution with the 1.3 sources. However, I wanted to confirm what I mentioned earlier: is it sufficient to copy the distribution only to the client host from where spark-submit is invoked(with spark.yarn.jar set), or is there a

Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
Thanks Shixiong! Very strange that our tasks were retried on the same executor again and again. I'll check spark.scheduler.executorTaskBlacklistTime. Jianshi On Mon, Mar 16, 2015 at 6:02 PM, Shixiong Zhu zsxw...@gmail.com wrote: There are 2 cases for No space left on device: 1. Some tasks

Re: insert hive partitioned table

2015-03-16 Thread Cheng Lian
I see. Since all Spark SQL queries must be issued from the driver side, you'll have to first collect all interested values to the driver side, and then use them to compose one or more insert statements. Cheng On 3/16/15 10:33 PM, patcharee wrote: I would like to insert the table, and the

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-03-16 Thread Todd Nist
Hi Bharath, I ran into the same issue a few days ago, here is a link to a post on Horton's fourm. http://hortonworks.com/community/forums/search/spark+1.2.1/ Incase anyone else needs to perform this these are the steps I took to get it to work with Spark 1.2.1 as well as Spark 1.3.0-RC3: 1.

Re: Processing of text file in large gzip archive

2015-03-16 Thread Nicholas Chammas
You probably want to update this line as follows: lines = sc.textFile('file.gz').repartition(sc.defaultParallelism * 3) For more details on why, see this answer http://stackoverflow.com/a/27631722/877069. Nick ​ On Mon, Mar 16, 2015 at 6:50 AM Marius Soutier mps@gmail.com wrote: 1. I

Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
Oh, by default it's set to 0L. I'll try setting it to 3 immediately. Thanks for the help! Jianshi On Mon, Mar 16, 2015 at 11:32 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Thanks Shixiong! Very strange that our tasks were retried on the same executor again and again. I'll check

Re: Scaling problem in RandomForest?

2015-03-16 Thread Xiangrui Meng
Try increasing the driver memory. We store trees on the driver node. If maxDepth=20 and numTrees=50, you may need a large driver memory to store all tree models. You might want to start with a smaller maxDepth and then increase it and see whether deep trees really help (vs. the cost). -Xiangrui

Re: Top rows per group

2015-03-16 Thread Xiangrui Meng
https://issues.apache.org/jira/browse/SPARK-5954 is for this issue and Shuo is working on it. We will first implement topByKey for RDD and them we could add it to DataFrames. -Xiangrui On Mon, Mar 9, 2015 at 9:43 PM, Moss rhoud...@gmail.com wrote: I do have a schemaRDD where I want to group by

Re: MappedStream vs Transform API

2015-03-16 Thread Tathagata Das
It's mostly for legacy reasons. First we had added all the MappedDStream, etc. and then later we realized we need to expose something that is more generic for arbitrary RDD-RDD transformations. It can be easily replaced. However, there is a slight value in having MappedDStream, for developers to

Any IRC channel on Spark?

2015-03-16 Thread Feng Lin
Hi, everyone, I'm wondering whether there is a possibility to setup an official IRC channel on freenode. I noticed that a lot of apache projects would have a such channel to let people talk directly. Best Michael

Basic GraphX deployment and usage question

2015-03-16 Thread Khaled Ammar
Hi, I'm very new to Spark and GraphX. I downloaded and configured Spark on a cluster, which uses Hadoop 1.x. The master UI shows all workers. The example command run-example SparkPi works fine and completes successfully. I'm interested in GraphX. Although the documentation says it is built-in

Re: Upgrade from Spark 1.1.0 to 1.1.1+ Issues

2015-03-16 Thread Eason Hu
Hi Akhil, Yes, I did change both versions on the project and the cluster. Any clues? Even the sample code from Spark website failed to work. Thanks, Eason On Sun, Mar 15, 2015 at 11:56 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Did you change both the versions? The one in your build

Re: [SPARK-3638 ] java.lang.NoSuchMethodError: org.apache.http.impl.conn.DefaultClientConnectionOperator.

2015-03-16 Thread Ted Yu
From my local maven repo: $ jar tvf ~/.m2/repository/org/apache/httpcomponents/httpclient/4.2.5/httpclient-4.2.5.jar | grep SchemeRegistry 1373 Fri Apr 19 18:19:36 PDT 2013 org/apache/http/impl/conn/SchemeRegistryFactory.class 2954 Fri Apr 19 18:19:36 PDT 2013

Creating a hive table on top of a parquet file written out by spark

2015-03-16 Thread kpeng1
Hi All, I wrote out a complex parquet file from spark sql and now I am trying to put a hive table on top. I am running into issues with creating the hive table itself. Here is the json that I wrote out to parquet using spark sql:

Re: Spark Streaming with compressed xml files

2015-03-16 Thread Vijay Innamuri
textFileStream and default fileStream recognizes the compressed xml(.xml.gz) files. Each line in the xml file is an element in RDD[string]. Then whole RDD is converted to a proper xml format data and stored in a *Scala variable*. - I believe storing huge data in a *Scala variable* is

Re: Priority queue in spark

2015-03-16 Thread twinkle sachdeva
Hi, Maybe this is what you are looking for : http://spark.apache.org/docs/1.2.0/job-scheduling.html#fair-scheduler-pools Thanks, On Mon, Mar 16, 2015 at 8:15 PM, abhi abhishek...@gmail.com wrote: Hi Current all the jobs in spark gets submitted using queue . i have a requirement where

Re: Iterate over contents of schemaRDD loaded from parquet file to extract timestamp

2015-03-16 Thread Cheng Lian
I don't see non-serializable objects in the provided snippets. But you can always add -Dsun.io.serialization.extendedDebugInfo=true to Java options to debug serialization errors. Cheng On 3/17/15 12:43 PM, anu wrote: Spark Version - 1.1.0 Scala - 2.10.4 I have loaded following type data

Re: Priority queue in spark

2015-03-16 Thread abhi
If i understand correctly , the above document creates pool for priority which is static in nature and has to be defined before submitting the job . .in my scenario each generated task can have different priority. Thanks, Abhi On Mon, Mar 16, 2015 at 9:48 PM, twinkle sachdeva

Re: Priority queue in spark

2015-03-16 Thread Mark Hamstra
http://apache-spark-developers-list.1001551.n3.nabble.com/Job-priority-td10076.html#a10079 On Mon, Mar 16, 2015 at 10:26 PM, abhi abhishek...@gmail.com wrote: If i understand correctly , the above document creates pool for priority which is static in nature and has to be defined before

Re: Priority queue in spark

2015-03-16 Thread abhi
yes . Each generated job can have a different priority it is like a recursive function, where in each iteration generate job will be submitted to the spark cluster based on the priority. jobs will lower priority or less than some threshold will be discarded. Thanks, Abhi On Mon, Mar 16, 2015

Iterate over contents of schemaRDD loaded from parquet file to extract timestamp

2015-03-16 Thread anu
Spark Version - 1.1.0 Scala - 2.10.4 I have loaded following type data from a parquet file, stored in a schemaRDD [7654321,2015-01-01 00:00:00.007,0.49,THU] Since, in spark version 1.1.0, parquet format doesn't support saving timestamp valuues, I have saved the timestamp data as string. Can you

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-03-16 Thread Bharath Ravi Kumar
Still no luck running purpose-built 1.3 against HDP 2.2 after following all the instructions. Anyone else faced this issue? On Mon, Mar 16, 2015 at 8:53 PM, Bharath Ravi Kumar reachb...@gmail.com wrote: Hi Todd, Thanks for the help. I'll try again after building a distribution with the 1.3

Re: Saving Dstream into a single file

2015-03-16 Thread Zhan Zhang
Each RDD has multiple partitions, each of them will produce one hdfs file when saving output. I don’t think you are allowed to have multiple file handler writing to the same hdfs file. You still can load multiple files into hive tables, right? Thanks.. Zhan Zhang On Mar 15, 2015, at 7:31

Can LBFGS be used on streaming data?

2015-03-16 Thread EcoMotto Inc.
Hello, I am new to spark streaming API. I wanted to ask if I can apply LBFGS (with LeastSquaresGradient) on streaming data? Currently I am using forecahRDD for parsing through DStream and I am generating a model based on each RDD. Am I doing anything logically wrong here? Thank you. Sample

Re: Querying JSON in Spark SQL

2015-03-16 Thread Matei Zaharia
The programming guide has a short example: http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets. Note that once you infer a schema for a JSON dataset, you can also use nested path notation

RE: Process time series RDD after sortByKey

2015-03-16 Thread Shuai Zheng
Hi Imran, I am a bit confused here. Assume I have RDD a with 1000 partition and also has been sorted. How can I control when creating RDD b (with 20 partitions) to make sure 1-50 partition of RDD a map to 1st partition of RDD b? I don’t see any control code/logic here? You code below:

Spark @ EC2: Futures timed out Ask timed out

2015-03-16 Thread Otis Gospodnetic
Hi, I've been trying to run a simple SparkWordCount app on EC2, but it looks like my apps are not succeeding/completing. I'm suspecting some sort of communication issue. I used the SparkWordCount app from http://blog.cloudera.com/blog/2014/04/how-to-run-a-simple-apache-spark-app-in-cdh-5/

Re: Streaming linear regression example question

2015-03-16 Thread Margus Roo
Tnx for the workaround. Margus (margusja) Roo http://margus.roo.ee skype: margusja +372 51 480 On 16/03/15 06:20, Jeremy Freeman wrote: Hi Margus, thanks for reporting this, I’ve been able to reproduce and there does indeed appear to be a bug. I’ve created a JIRA and have a fix ready, can

Re: Spark Streaming with compressed xml files

2015-03-16 Thread Akhil Das
One approach would be, If you are using fileStream you can access the individual filenames from the partitions and with that filename you can apply your uncompression logic/parsing logic and get it done. Like: UnionPartition upp = (UnionPartition) ds.values().getPartitions()[i];

Re: Upgrade from Spark 1.1.0 to 1.1.1+ Issues

2015-03-16 Thread Akhil Das
Did you change both the versions? The one in your build file of your project and the spark version of your cluster? Thanks Best Regards On Sat, Mar 14, 2015 at 6:47 AM, EH eas...@gmail.com wrote: Hi all, I've been using Spark 1.1.0 for a while, and now would like to upgrade to Spark 1.1.1

Re: org.apache.spark.SparkException Error sending message

2015-03-16 Thread Akhil Das
Not sure if this will help, but can you try setting the following: set(spark.core.connection.ack.wait.timeout,6000) Thanks Best Regards On Sat, Mar 14, 2015 at 4:08 AM, Chen Song chen.song...@gmail.com wrote: When I ran Spark SQL query (a simple group by query) via hive support, I have seen

Re: how to print RDD by key into file with grouByKey

2015-03-16 Thread Akhil Das
If you want more partitions then you have specify it as: Rdd.groupByKey(*10*).mapValues... ​I think if you don't specify anything, the # partitions will be the # cores that you have for processing.​ Thanks Best Regards On Sat, Mar 14, 2015 at 12:28 AM, Adrian Mocanu amoc...@verticalscope.com

Re: Running Scala Word Count Using Maven

2015-03-16 Thread Su She
Hello, So actually solved the problem...see point 3. Here are a few approaches/errors I was getting: 1) mvn package exec:java -Dexec.mainClass=HelloWorld Error: java.lang.ClassNotFoundException: HelloWorld 2)

Re: Re: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-03-16 Thread sandeep vura
which location should i need to specify the classpath exactly . Thanks, On Mon, Mar 16, 2015 at 12:52 PM, Cheng, Hao hao.ch...@intel.com wrote: It doesn’t take effect if just putting jar files under the lib-managed/jars folder, you need to put that under class path explicitly. *From:*

Re: Question about Spark Streaming Receiver Failure

2015-03-16 Thread Jun Yang
Dibyendu, Thanks for the reply. I am reading your project homepage now. One quick question I care about is: If the receivers failed for some reasons(for example, killed brutally by someone else), is there any mechanism for the receiver to fail over automatically? On Mon, Mar 16, 2015 at 3:25

Spark Streaming with compressed xml files

2015-03-16 Thread Vijay Innamuri
Hi All, Processing streaming JSON files with Spark features (Spark streaming and Spark SQL), is very efficient and works like a charm. Below is the code snippet to process JSON files. windowDStream.foreachRDD(IncomingFiles = { val IncomingFilesTable =

Re: Re: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-03-16 Thread sandeep vura
Hi Fightfate, I have attached my hive-site.xml file in the previous mail.Please check the configuration once. In hive i am able to create tables and also able to load data into hive table. Please find the attached file. Regards, Sandeep.v On Mon, Mar 16, 2015 at 11:34 AM, fightf...@163.com

why generateJob is a private API?

2015-03-16 Thread madhu phatak
Hi, I am trying to create a simple subclass of DStream. If I understand correctly, I should override *compute *lazy operations and *generateJob* for actions. But when I try to override, generateJob it gives error saying method is private to the streaming package. Is my approach is correct or am

RE: Re: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-03-16 Thread Cheng, Hao
Or you need to specify the jars either in configuration or bin/spark-sql --jars mysql-connector-xx.jar From: fightf...@163.com [mailto:fightf...@163.com] Sent: Monday, March 16, 2015 2:04 PM To: sandeep vura; Ted Yu Cc: user Subject: Re: Re: Unable to instantiate

Re: Re: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-03-16 Thread sandeep vura
I have already added mysql-connector-xx.jar file in spark/lib-managed/jars directory. Regards, Sandeep.v On Mon, Mar 16, 2015 at 11:48 AM, Cheng, Hao hao.ch...@intel.com wrote: Or you need to specify the jars either in configuration or bin/spark-sql --jars mysql-connector-xx.jar

Re: Need Advice about reading lots of text files

2015-03-16 Thread madhu phatak
Hi, Internally Spark uses HDFS api to handle file data. Have a look at HAR, Sequence file input format. More information on this cloudera blog http://blog.cloudera.com/blog/2009/02/the-small-files-problem/. Regards, Madhukara Phatak http://datamantra.io/ On Sun, Mar 15, 2015 at 9:59 PM, Pat

Re: Spark will process _temporary folder on S3 is very slow and always cause failure

2015-03-16 Thread Akhil Das
If you use fileStream, there's an option to filter out files. In your case you can easily create a filter to remove _temporary files. In that case, you will have to move your codes inside foreachRDD of the dstream since the application will become a streaming app. Thanks Best Regards On Sat, Mar

Question about Spark Streaming Receiver Failure

2015-03-16 Thread Jun Yang
Guys, We have a project which builds upon Spark streaming. We use Kafka as the input stream, and create 5 receivers. When this application runs for around 90 hour, all the 5 receivers failed for some unknown reasons. In my understanding, it is not guaranteed that Spark streaming receiver will

Does spark-1.3.0 support the analytic functions defined in Hive, such as row_number, rank

2015-03-16 Thread hseagle
Hi all, I'm wondering whether the latest spark-1.3.0 supports the windowing and analytic funtions in hive, such as row_number, rank and etc. Indeed, I've done some testing by using spark-shell and found that row_number is not supported yet. But I still found that there were

How to set Spark executor memory?

2015-03-16 Thread Xi Shen
Hi, I have set spark.executor.memory to 2048m, and in the UI Environment page, I can see this value has been set correctly. But in the Executors page, I saw there's only 1 executor and its memory is 265.4MB. Very strange value. why not 256MB, or just as what I set? What am I missing here?

Re: How to set Spark executor memory?

2015-03-16 Thread Akhil Das
How are you setting it? and how are you submitting the job? Thanks Best Regards On Mon, Mar 16, 2015 at 12:52 PM, Xi Shen davidshe...@gmail.com wrote: Hi, I have set spark.executor.memory to 2048m, and in the UI Environment page, I can see this value has been set correctly. But in the

Re: Question about Spark Streaming Receiver Failure

2015-03-16 Thread Jun Yang
Akhil, I have checked the logs. There isn't any clue as to why the 5 receivers failed. That's why I just take it for granted that it will be a common issue for receiver failures, and we need to figure out a way to detect this kind of failure and do fail-over. Thanks On Mon, Mar 16, 2015 at

Re: k-means hang without error/warning

2015-03-16 Thread Akhil Das
How many threads are you allocating while creating the sparkContext? like local[4] will allocate 4 threads. You can try increasing it to a higher number also try setting level of parallelism to a higher number. Thanks Best Regards On Mon, Mar 16, 2015 at 9:55 AM, Xi Shen davidshe...@gmail.com

Re: Question about Spark Streaming Receiver Failure

2015-03-16 Thread Akhil Das
You need to figure out why the receivers failed in the first place. Look in your worker logs and see what really happened. When you run a streaming job continuously for longer period mostly there'll be a lot of logs (you can enable log rotation etc.) and if you are doing a groupBy, join, etc type

RE: Re: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-03-16 Thread Cheng, Hao
It doesn’t take effect if just putting jar files under the lib-managed/jars folder, you need to put that under class path explicitly. From: sandeep vura [mailto:sandeepv...@gmail.com] Sent: Monday, March 16, 2015 2:21 PM To: Cheng, Hao Cc: fightf...@163.com; Ted Yu; user Subject: Re: Re: Unable

Re: Question about Spark Streaming Receiver Failure

2015-03-16 Thread Dibyendu Bhattacharya
Which version of Spark you are running ? You can try this Low Level Consumer : http://spark-packages.org/package/dibbhatt/kafka-spark-consumer This is designed to recover from various failures and have very good fault recovery mechanism built in. This is being used by many users and at present

  1   2   >