Re: Stop Cluster Mode Running App

2015-05-08 Thread James King
Many Thanks Silvio, Someone also suggested using something similar : ./bin/spark-class org.apache.spark.deploy.Client kill master url driver ID Regards jk On Fri, May 8, 2015 at 2:12 AM, Silvio Fiorito silvio.fior...@granturing.com wrote: Hi James, If you’re on Spark 1.3 you can use

RE: Possible long lineage issue when using DStream to update a normal RDD

2015-05-08 Thread Shao, Saisai
I think you could use checkpoint to cut the lineage of `MyRDD`, I have a similar scenario and I use checkpoint to workaround this problem :) Thanks Jerry -Original Message- From: yaochunnan [mailto:yaochun...@gmail.com] Sent: Friday, May 8, 2015 1:57 PM To: user@spark.apache.org

Re: Dismatch when use sparkSQL insert data into a hive table with dynamic partition datetype

2015-05-08 Thread Gerald-G
Resolved by create udf to match different format of DateType in sparksql and HIVE suggest fix it in next release On Fri, May 8, 2015 at 11:07 AM, Gerald-G shadowinl...@gmail.com wrote: Hi: Spark version is 1.3.1 I used sparksql insert data into a hive table with datetype partition The

Re: Possible long lineage issue when using DStream to update a normal RDD

2015-05-08 Thread Chunnan Yao
Thank you for this suggestion! But may I ask what's the advantage to use checkpoint instead of cache here? Cuz they both cut lineage. I only know checkpoint saves RDD in disk, while cache in memory. So may be it's for reliability? Also on

Re: AWS-Credentials fails with org.apache.hadoop.fs.s3.S3Exception: FORBIDDEN

2015-05-08 Thread Akhil Das
Have a look at this SO http://stackoverflow.com/questions/24048729/how-to-read-input-from-s3-in-a-spark-streaming-ec2-cluster-application question, it has discussion on various ways of accessing S3. Thanks Best Regards On Fri, May 8, 2015 at 1:21 AM, in4maniac sa...@skimlinks.com wrote: Hi

updateStateByKey - how to generate a stream of state changes?

2015-05-08 Thread mini saw
imagine an input stream transformed by updateStateByKey, based on some state. as an output of the transformation, i would like to have a stream of state changes only - not the stream of states themselves. what is the natural way of obtaining such a stream?

RE: The explanation of input text format using LDA in Spark

2015-05-08 Thread Yang, Yuhao
Hi Cui, Try to read the scala version of LDAExample, https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/LDAExample.scala The matrix you're referring to is the corpus after vectorization. One example, given a dict, [apple, orange, banana] 3

Re: Master node memory usage question

2015-05-08 Thread Akhil Das
Whats your usecase and what are you trying to achieve? May be there's a better way of doing it. Thanks Best Regards On Fri, May 8, 2015 at 10:20 AM, Richard Alex Hofer rho...@andrew.cmu.edu wrote: Hi, I'm working on a project in Spark and am trying to understand what's going on. Right now to

SparkStreaming + Flume/PDI+Kafka

2015-05-08 Thread GARCIA MIGUEL, DAVID
Hi! I've been using spark for the last months and it is awesome. I'm pretty new on this topic so don't be too harsh on me. Recently I've been doing some simple tests with Spark Streaming for log processing and I'm considering different ETL input solutions such as Flume or PDI+Kafka. My use

[SparkSQL] cannot filter by a DateType column

2015-05-08 Thread Haopu Wang
I want to filter a DataFrame based on a Date column. If the DataFrame object is constructed from a scala case class, it's working (either compare as String or Date). But if the DataFrame is generated by specifying a Schema to an RDD, it doesn't work. Below is the exception and test code.

RE: Possible long lineage issue when using DStream to update a normal RDD

2015-05-08 Thread Shao, Saisai
IIUC only checkpoint will clean the lineage information, cache will not cut the lineage. Also checkpoint will put the data in HDFS, not local disk :) I think you can use foreachRDD to do such RDD update work, it’s OK as I know from your code snippet. From: Chunnan Yao

Re: (无主题)

2015-05-08 Thread Akhil Das
Since its loading 24 records, it could be that your CSV is corrupted? (may be the new line char isn't \n, but \r\n if it comes from a windows environment. You can check this with *cat -v yourcsvfile.csv | more*). Thanks Best Regards On Fri, May 8, 2015 at 11:23 AM, luohui20...@sina.com wrote:

Re: Getting data into Spark Streaming

2015-05-08 Thread Akhil Das
I don't think you can use rawSocketStream since the RSVP is from a web server and you will have to send a GET request first to initialize the communication. You are better off writing a custom receiver https://spark.apache.org/docs/latest/streaming-custom-receivers.html for your usecase. For a

Re: YARN mode startup takes too long (10+ secs)

2015-05-08 Thread Zoltán Zvara
So is this sleep occurs before allocating resources for the first few executors to start the job? On Fri, May 8, 2015 at 6:23 AM Taeyun Kim taeyun@innowireless.com wrote: I think I’ve found the (maybe partial, but major) reason. It’s between the following lines, (it’s newly captured,

updateStateByKey - how to generate a stream of state changes?

2015-05-08 Thread minisaw
imagine an input stream transformed by updateStateByKey, based on some state. as an output of the transformation, i would like to have a stream of state changes only - not the stream of states themselves. what is the natural way of obtaining such a stream? -- View this message in context:

Re: [SQL][Dataframe] Change data source after saveAsParquetFile

2015-05-08 Thread ayan guha
From S3. As the dependency of df will be on s3. And because rdds are not replicated. On 8 May 2015 23:02, Peter Rudenko petro.rude...@gmail.com wrote: Hi, i have a next question: val data = sc.textFile(s3:///)val df = data.toDF df.saveAsParquetFile(hdfs://) df.someAction(...) if during

Re: [SQL][Dataframe] Change data source after saveAsParquetFile

2015-05-08 Thread Peter Rudenko
Hm, thanks. Do you know what this setting mean: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L1178 ? Thanks, Peter Rudenko On 2015-05-08 17:48, ayan guha wrote: From S3. As the dependency of df will be on s3. And because rdds are

Re: Virtualenv pyspark

2015-05-08 Thread Nicholas Chammas
This is an interesting question. I don't have a solution for you, but you may be interested in taking a look at Anaconda Cluster http://continuum.io/anaconda-cluster. It's made by the same people behind Conda (an alternative to pip focused on data science pacakges) and may offer a better way of

Cluster mode and supervised app with multiple Masters

2015-05-08 Thread James King
Why does this not work ./spark-1.3.0-bin-hadoop2.4/bin/spark-submit --class SomeApp --deploy-mode cluster --supervise --master spark://host01:7077,host02:7077 Some.jar With exception: Caused by: java.lang.NumberFormatException: For input string: 7077,host02:7077 It seems to accept only one

filterRDD and flatMap

2015-05-08 Thread hmaeda
Dear Usergroup, I am struggling to use the SparkR pacakge that comes with apache spark 1.4.0 I am having trouble getting the tutorials in the original amplabs-extra/SparkR-pkg working. Please see my stackoverflow question with a bounty for more details...here

Submit Spark application in cluster mode and supervised

2015-05-08 Thread James King
I have two hosts host01 and host02 (lets call them) I run one Master and two Workers on host01 I also run one Master and two Workers on host02 Now I have 1 LIVE Master on host01 and a STANDBY Master on host02 The LIVE Master is aware of all Workers in the cluster Now I submit a Spark

Re: Submit Spark application in cluster mode and supervised

2015-05-08 Thread James King
BTW I'm using Spark 1.3.0. Thanks On Fri, May 8, 2015 at 5:22 PM, James King jakwebin...@gmail.com wrote: I have two hosts host01 and host02 (lets call them) I run one Master and two Workers on host01 I also run one Master and two Workers on host02 Now I have 1 LIVE Master on host01 and a

dependencies on java-netlib and jblas

2015-05-08 Thread John Niekrasz
Newbie question... Can I use any of the main ML capabilities of MLlib in a Java-only environment, without any native library dependencies? According to the documentation, java-netlib provides a JVM fallback. This suggests that native netlib libraries are not required. It appears that such a

Re: spark-shell breaks for scala 2.11 (with yarn)?

2015-05-08 Thread Koert Kuipers
i searched the jiras but couldnt find any recent mention of this. let me try with 1.4.0 branch and see if it goes away... On Wed, May 6, 2015 at 3:05 PM, Koert Kuipers ko...@tresata.com wrote: hello all, i build spark 1.3.1 (for cdh 5.3 with yarn) twice: for scala 2.10 and scala 2.11. i am

Re: dependencies on java-netlib and jblas

2015-05-08 Thread Sonal Goyal
Hi John, I have been using MLLIB without installing jblas native dependence. Functionally I have not got stuck. I still need to explore if there are any performance hits. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, May 8,

parallelism on binary file

2015-05-08 Thread tog
Hi I havé an application that currently run using MR. It currently starts extracting information from a proprietary binary file that is copied to HDFS. The application starts by creating business objects from information extracted from the binary files. Later those objects are used for further

Re: [SQL][Dataframe] Change data source after saveAsParquetFile

2015-05-08 Thread Michael Armbrust
Thats a feature flag for a new code path for reading parquet files. Its only there in case bugs are found in the old path and will be removed once we are sure the new path is solid. On Fri, May 8, 2015 at 8:04 AM, Peter Rudenko petro.rude...@gmail.com wrote: Hm, thanks. Do you know what this

Cassandra number of Tasks

2015-05-08 Thread Vijay Pawnarkar
I am using the Spark Cassandra connector to work with a table with 3 million records. Using .where() API to work with only a certain rows in this table. Where clause filters the data to 1 rows. CassandraJavaUtil.javaFunctions(sparkContext) .cassandraTable(KEY_SPACE, MY_TABLE,

Spark Cassandra connector number of Tasks

2015-05-08 Thread vijaypawnarkar
I am using the Spark Cassandra connector to work with a table with 3 million records. Using .where() API to work with only a certain rows in this table. Where clause filters the data to 1 rows. CassandraJavaUtil.javaFunctions(sparkContext) .cassandraTable(KEY_SPACE, MY_TABLE,

Re: Map one RDD into two RDD

2015-05-08 Thread anshu shukla
Any update to above mail and Can anyone tell me logic - I have to filter tweets and submit tweets with particular #hashtag1 to SparkSQL databases and tweets with #hashtag2 will be passed to sentiment analysis phase .Problem is how to split the input data in two streams using hashtags On

Re: Submit Spark application in cluster mode and supervised

2015-05-08 Thread Silvio Fiorito
If you’re using multiple masters with ZooKeeper then you should set your master URL to be spark://host01:7077,host02:7077 And the property spark.deploy.recoveryMode=ZOOKEEPER See here for more info: http://spark.apache.org/docs/latest/spark-standalone.html#standby-masters-with-zookeeper

Re: [SparkSQL] cannot filter by a DateType column

2015-05-08 Thread Michael Armbrust
What version of Spark are you using? It appears that at least in master we are doing the conversion correctly, but its possible older versions of applySchema do not. If you can reproduce the same bug in master, can you open a JIRA? On Fri, May 8, 2015 at 1:36 AM, Haopu Wang hw...@qilinsoft.com

Re: dependencies on java-netlib and jblas

2015-05-08 Thread Sean Owen
Yes, at this point I believe you'll find jblas used for historical reasons, to not change some APIs. I don't believe it's used for much if any computation in 1.4. On May 8, 2015 5:04 PM, John Niekrasz john.niekr...@gmail.com wrote: Newbie question... Can I use any of the main ML capabilities

Lambda architecture using Apache Spark

2015-05-08 Thread rafac
I am implementing the lambda architecture using apache spark for both streaming and batch processing. For real time queries i´m using spark streaming with cassandra and for batch queries i am using spark sql and spark mlib. The problem i ´m facing now is: i need to implemente one serving layer,

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-08 Thread Josh Rosen
Do you have any more specific profiling data that you can share? I'm curious to know where AppendOnlyMap.changeValue is being called from. On Fri, May 8, 2015 at 1:26 PM, Michal Haris michal.ha...@visualdna.com wrote: +dev On 6 May 2015 10:45, Michal Haris michal.ha...@visualdna.com wrote:

RE: Lambda architecture using Apache Spark

2015-05-08 Thread Mohammed Guller
Why are you not using Cassandra for storing the pre-computed views? Mohammed -Original Message- From: rafac [mailto:rafaelme...@hotmail.com] Sent: Friday, May 8, 2015 1:48 PM To: user@spark.apache.org Subject: Lambda architecture using Apache Spark I am implementing the lambda

Spark streaming updating a large window more frequently

2015-05-08 Thread Ankur Chauhan
Hi, I am pretty new to spark/spark_streaming so please excuse my naivety. I have streaming event stream that is timestamped and I would like to aggregate it into, let's say, hourly buckets. Now the simple answer is to use a window operation with window length of 1 hr and sliding interval of

RE: Spark streaming updating a large window more frequently

2015-05-08 Thread Mohammed Guller
If I understand you correctly, you need Window duration of 1 hour and sliding interval of 5 seconds. Mohammed -Original Message- From: Ankur Chauhan [mailto:achau...@brightcove.com] Sent: Friday, May 8, 2015 2:27 PM To: u...@spark.incubator.apache.org Subject: Spark streaming

CREATE TABLE ignores database when using PARQUET option

2015-05-08 Thread Carlos Pereira
Hi, I would like to create a hive table on top a existent parquet file as described here: https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html Due network restrictions, I need to store the metadata definition in a different path than '/user/hive/warehouse', so I

[SQL][Dataframe] Change data source after saveAsParquetFile

2015-05-08 Thread Peter Rudenko
Hi, i have a next question: |val data = sc.textFile(s3:///) val df = data.toDF df.saveAsParquetFile(hdfs://) df.someAction(...) | if during someAction some workers would die, would recomputation download files from s3 or from hdfs parquet? Thanks, Peter Rudenko ​

Re: Spark SQL and Hive interoperability

2015-05-08 Thread jdavidmitchell
So, why isn't this comment/question being posted to the list? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-and-Hive-interoperability-tp22690p22827.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark SQL: STDDEV working in Spark Shell but not in a standalone app

2015-05-08 Thread Yin Huai
Can you attach the full stack trace? Thanks, Yin On Fri, May 8, 2015 at 4:44 PM, barmaley o...@solver.com wrote: Given a registered table from data frame, I'm able to execute queries like sqlContext.sql(SELECT STDDEV(col1) FROM table) from Spark Shell just fine. However, when I run exactly

Re: Spark + Kinesis + Stream Name + Cache?

2015-05-08 Thread Chris Fregly
hey mike- as you pointed out here from my docs, changing the stream name is sometimes problematic due to the way the Kinesis Client Library manages leases and checkpoints, etc in DynamoDB. I noticed this directly while developing the Kinesis connector which is why I highlighted the issue

Re: CREATE TABLE ignores database when using PARQUET option

2015-05-08 Thread Carlos Pereira
Thanks Michael for the quick return. I was looking forward the automatic schema inferring (I think that's you mean by 'schema merging' ?), and I think the STORED AS would still require me to define the table columns right? Anyways, I am glad to hear you guys already working to fix this on future

Re: Hash Partitioning and Dataframes

2015-05-08 Thread Michael Armbrust
What are you trying to accomplish? Internally Spark SQL will add Exchange operators to make sure that data is partitioned correctly for joins and aggregations. If you are going to do other RDD operations on the result of dataframe operations and you need to manually control the partitioning,

RE: Hash Partitioning and Dataframes

2015-05-08 Thread Daniel, Ronald (ELS-SDG)
Just trying to make sure that something I know in advance (the joins will always have an equality test on one specific field) is used to optimize the partitioning so the joins only use local data. Thanks for the info. Ron From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Friday,

Re: CREATE TABLE ignores database when using PARQUET option

2015-05-08 Thread Michael Armbrust
Actually, I was talking about the support for inferring different but compatible schemata from various files, automatically merging them into a single schema. However, you are right that I think you need to specify the columns / types if you create it as a Hive table. On Fri, May 8, 2015 at 3:11

Re: Unable to join table across data sources using sparkSQL

2015-05-08 Thread Ishwardeep Singh
Finally got it working. I was trying to access hive using the jdbc driver like I was trying to access the terradata. It took me some time to figure out that default sqlContext created by Spark supported hive and it uses the hive-site.xml in spark conf folder to access hive. I had to use my

Re: AWS-Credentials fails with org.apache.hadoop.fs.s3.S3Exception: FORBIDDEN

2015-05-08 Thread in4maniac
HI GUYS... I realised that it was a bug in my code that caused the code to break.. I was running the filter on a SchemaRDD when I was supposed to be running it on an RDD. But I still don't understand why the stderr was about S3 request rather than a type checking error such as No tuple position

Re: CREATE TABLE ignores database when using PARQUET option

2015-05-08 Thread ayan guha
I am just wondering if create table supports the syntax of Create table dB.tablename Instead of two step process of use dB and then create table tablename? On 9 May 2015 08:17, Michael Armbrust mich...@databricks.com wrote: Actually, I was talking about the support for inferring different but

Re: Spark + Kinesis + Stream Name + Cache?

2015-05-08 Thread Mike Trienis
Hey Chris! I was happy to see the documentation outlining that issue :-) However, I must have got into a pretty terrible state because I had to delete and recreate the kinesis streams as well as the DynamoDB tables. Thanks for the reply, everything is sorted. Mike On Fri, May 8, 2015 at

Re: Map one RDD into two RDD

2015-05-08 Thread ayan guha
Do as Evo suggested. Rdd1=rdd.filter, rdd2=rdd.filter On 9 May 2015 05:19, anshu shukla anshushuk...@gmail.com wrote: Any update to above mail and Can anyone tell me logic - I have to filter tweets and submit tweets with particular #hashtag1 to SparkSQL databases and tweets with

Re: Spark + Kinesis + Stream Name + Cache?

2015-05-08 Thread Mike Trienis
- [Kinesis stream name]: The Kinesis stream that this streaming application receives from - The application name used in the streaming context becomes the Kinesis application name - The application name must be unique for a given account and region. - The Kinesis

Spark SQL: STDDEV working in Spark Shell but not in a standalone app

2015-05-08 Thread barmaley
Given a registered table from data frame, I'm able to execute queries like sqlContext.sql(SELECT STDDEV(col1) FROM table) from Spark Shell just fine. However, when I run exactly the same code in a standalone app on a cluster, it throws an exception: java.util.NoSuchElementException: key not found:

Spark + Kinesis + Stream Name + Cache?

2015-05-08 Thread Mike Trienis
Hi All, I am submitting the assembled fat jar file by the command: bin/spark-submit --jars /spark-streaming-kinesis-asl_2.10-1.3.0.jar --class com.xxx.Consumer -0.1-SNAPSHOT.jar It reads the data file from kinesis using the stream name defined in a configuration file. It turns out that it