Re: Feature Generation On Spark

2015-07-04 Thread ayan guha
Do you have one document per file or multiple document in the file? On 4 Jul 2015 23:38, Michal Čizmazia mici...@gmail.com wrote: Spark Context has a method wholeTextFiles. Is that what you need? On 4 July 2015 at 07:04, rishikesh rishikeshtha...@hotmail.com wrote: Hi I am new to Spark

Get Spark version before starting context

2015-07-04 Thread Patrick Woody
Hey all, Is it possible to reliably get the version string of a Spark cluster prior to trying to connect via the SparkContext on the client side? Most of the errors I've seen on mismatched versions have been cryptic, so it would be helpful if I could throw an exception earlier. I know it is

Re: Get Spark version before starting context

2015-07-04 Thread Patrick Woody
To somewhat answer my own question - it looks like an empty request to the rest API will throw an error which returns the version in JSON as well. Still not ideal though. Would there be any objection to adding a simple version endpoint to the API? On Sat, Jul 4, 2015 at 4:00 PM, Patrick Woody

Authorisation issue in Spark while using SQL based Authorization

2015-07-04 Thread PKUKILLA
Though i have set hive.security.authorization.enabled=true and hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory a User X can select table belonging to User Y as for some reason Spark SQL Thrift server is not doing

calling HiveContext.table or running a query reads files unnecessarily in S3

2015-07-04 Thread Steve Lindemann
Hi, I'm just getting started with Spark so apologies if this I'm missing something obvious. In the below, I'm using Spark 1.4. I've created a partitioned table in S3 (call it 'dataset'), with basic structure like so: s3://bucket/dataset/pk=a s3://bucket/dataset/pk=b s3://bucket/dataset/pk=c

Re: All master are unreponsive issue

2015-07-04 Thread Ted Yu
Currently the number of retries is hardcoded. You may want to open a JIRA which makes the retry count configurable. Cheers On Thu, Jul 2, 2015 at 8:35 PM, luohui20...@sina.com wrote: Hi there, i check the source code and found that in org.apache.spark.deploy.client.AppClient, there

Re: text file stream to HDFS

2015-07-04 Thread Ted Yu
Please take a look at streaming/src/main/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.scala def saveAsHadoopFiles[F : OutputFormat[K, V]]( prefix: String, suffix: String )(implicit fm: ClassTag[F]): Unit = ssc.withScope { Cheers On Sat, Jul 4, 2015 at 5:23

Splitting dataframe using Spark 1.4 for nested json input

2015-07-04 Thread Mike Tracy
Hello, I am having issues with splitting contents of a dataframe column using Spark 1.4. The dataframe was created by reading a nested complex json file. I used df.explode but keep getting error message. scala val df = sqlContext.read.json(/Users/xx/target/statsfile.json) scala df.show()

Restarting Spark Streaming Application with new code

2015-07-04 Thread Vinoth Chandar
Hi, Just looking for some clarity on the below 1.4 documentation. And restarting from earlier checkpoint information of pre-upgrade code cannot be done. The checkpoint information essentially contains serialized Scala/Java/Python objects and trying to deserialize objects with new, modified

RE: Feature Generation On Spark

2015-07-04 Thread rishikesh thakur
Hi Thanks, I guess this will solve my problem. I will load mutiple files using wildcard's likes *.csv. I guess if I use wholeTextFile instead of textFile, I will get whole file contents as value which will in turn ensure one feature vector per file. thanksNitin Date: Sat, 4 Jul 2015 09:37:52

RE: Feature Generation On Spark

2015-07-04 Thread rishikesh thakur
I have one document per file and each file is to be converted to a feature vector. Pretty much like standard feature construction for document classification. ThanksRishi Date: Sun, 5 Jul 2015 01:44:04 +1000 Subject: Re: Feature Generation On Spark From: guha.a...@gmail.com To:

JDBC Streams

2015-07-04 Thread ayan guha
Hi All I have a requireent to connect to a DB every few minutes and bring data to HBase. Can anyone suggest if spark streaming would be appropriate for this senario or I shoud look into jobserver? Thanks in advance -- Best Regards, Ayan Guha

Re: mvn build hangs on: Dependency-reduced POM written at bagel/dependency-reduced-pom.xml

2015-07-04 Thread Ted Yu
See this thread: http://search-hadoop.com/m/q3RTt4CqUGAvnPj2/Spark+master+buildsubj=Re+Can+not+build+master On Jul 4, 2015, at 9:44 PM, Alec Taylor alec.tayl...@gmail.com wrote: Running: `build/mvn -DskipTests clean package` on Ubuntu 15.04 (amd64, 3.19.0-21-generic) with Apache Maven

Re: mvn build hangs on: Dependency-reduced POM written at bagel/dependency-reduced-pom.xml

2015-07-04 Thread Alec Taylor
Thanks, will just build from spark-1.4.0.tgz in the meantime. On Sun, Jul 5, 2015 at 2:52 PM, Ted Yu yuzhih...@gmail.com wrote: See this thread: http://search-hadoop.com/m/q3RTt4CqUGAvnPj2/Spark+master+buildsubj=Re+Can+not+build+master On Jul 4, 2015, at 9:44 PM, Alec Taylor

Spark got stuck with BlockManager after computing connected components using GraphX

2015-07-04 Thread Hellen
I'm computing connected components using Spark GraphX on AWS EC2. I believe the computation was successful, as I saw the type information of the final result. However, it looks like Spark was doing some cleanup. The BlockManager removed a bunch of blocks and stuck at 15/07/04 21:53:06 INFO

text file stream to HDFS

2015-07-04 Thread ravi tella
Hello, How should I write a text file stream DStream to HDFS. I tried the the following val lines = ssc.textFileStream(hdfs:/user/hadoop/spark/streaming/input/) lines.saveAsTextFile(hdfs:/user/hadoop/output1) val lines = ssc.textFileStream(hdfs:/user/hadoop/spark/streaming/input/)

mvn build hangs on: Dependency-reduced POM written at bagel/dependency-reduced-pom.xml

2015-07-04 Thread Alec Taylor
Running: `build/mvn -DskipTests clean package` on Ubuntu 15.04 (amd64, 3.19.0-21-generic) with Apache Maven 3.3.3 starts to build fine, then just keeps outputting these lines: [INFO] Dependency-reduced POM written at: /spark/bagel/dependency-reduced-pom.xml I've kept it running for an hour. How

Re: Are Spark Streaming RDDs always processed in order?

2015-07-04 Thread Michal Čizmazia
I had a similar inquiry, copied below. I was also looking into making an SQS Receiver reliable: http://stackoverflow.com/questions/30809975/reliable-sqs-receiver-for-spark-streaming Hope this helps. -- Forwarded message -- From: Tathagata Das t...@databricks.com Date: 20 June

Feature Generation On Spark

2015-07-04 Thread rishikesh
Hi I am new to Spark and am working on document classification. Before model fitting I need to do feature generation. Each document is to be converted to a feature vector. However I am not sure how to do that. While testing locally I have a static list of tokens and when I parse a file I do a