Re: SparkStreaming batch processing time question

2015-04-01 Thread Akhil Das
It will add scheduling delay for the new batch. The new batch data will be processed after finish up the previous batch, when the time is too high, sometimes it will throw fetch failures as the batch data could get removed from memory. Thanks Best Regards On Wed, Apr 1, 2015 at 11:35 AM,

Strategy regarding maximum number of executor's failure for log running jobs/ spark streaming jobs

2015-04-01 Thread twinkle sachdeva
Hi, In spark over YARN, there is a property spark.yarn.max.executor.failures which controls the maximum number of executor's failure an application will survive. If number of executor's failures ( due to any reason like OOM or machine failure etc ), increases this value then applications quits.

RE: Using 'fair' scheduler mode with thrift server

2015-04-01 Thread Judy Nash
The expensive query can take all executor slots, but no task occupy the executor permanently. i.e. The second job can possibly to take some resources to execute in-between tasks of the expensive queries. Can the fair scheduler mode help in this case? Or is it possible to setup thrift such that

SparkStreaming batch processing time question

2015-04-01 Thread luohui20001
hi guys: I got a question when reading http://spark.apache.org/docs/latest/streaming-programming-guide.html#setting-the-right-batch-interval. What will happen to the streaming data if the batch processing time is bigger than the batch interval? Will the next batch data be

Re: Query REST web service with Spark?

2015-04-01 Thread Emre Sevinc
Hello Minnow, It is possible. You can for example use Jersey REST client to query a web service and get its results in a Spark job. In fact, that's what we did actually in a recent project (in a Spark Streaming application). Kind regards, Emre Sevinç http://www.bigindustries.be/ On Tue, Mar

Re: rdd.cache() not working ?

2015-04-01 Thread Taotao.Li
rerun person.count and you will see the performance of cache. person.cache would not cache it right now. It'll actually cache this RDD after one action[person.count here] - 原始邮件 - 发件人: fightf...@163.com 收件人: user user@spark.apache.org 发送时间: 星期三, 2015年 4 月 01日 下午 1:21:25 主题:

Re: Using 'fair' scheduler mode

2015-04-01 Thread Raghavendra Pandey
I am facing the same issue. FAIR and FIFO behaving in the same way. On Wed, Apr 1, 2015 at 1:49 AM, asadrao as...@microsoft.com wrote: Hi, I am using the Spark ‘fair’ scheduler mode. I have noticed that if the first query is a very expensive query (ex: ‘select *’ on a really big data set)

Re: When do map how to get the line number?

2015-04-01 Thread jitesh129
You can use zipWithIndex() to get index for each record and then you can increment by 1 for each index. val tf=sc.textFile(test).zipWithIndex() tf.map(s=(s[1]+1,s[0])) Above should serve your purpose. -- View this message in context:

Re: --driver-memory parameter doesn't work for spark-submmit on yarn?

2015-04-01 Thread Akhil Das
Once you submit the job do a ps aux | grep spark-submit and see how much is the heap space allocated to the process (the -Xmx params), if you are seeing a lower value you could try increasing it yourself with: export _JAVA_OPTIONS=-Xmx5g Thanks Best Regards On Wed, Apr 1, 2015 at 1:57 AM, Shuai

回复:Re: SparkStreaming batch processing time question

2015-04-01 Thread luohui20001
hummm, got it. Thank you Akhil. Thanksamp;Best regards! 罗辉 San.Luo - 原始邮件 - 发件人:Akhil Das ak...@sigmoidanalytics.com 收件人:罗辉 luohui20...@sina.com 抄送人:user user@spark.apache.org 主题:Re: SparkStreaming batch processing time question 日期:2015年04月01日

Re: Re: rdd.cache() not working ?

2015-04-01 Thread Sean Owen
No, cache() changes the bookkeeping of the existing RDD. Although it returns a reference, it works to just call person.cache. I can't reproduce this. When I try to cache an RDD and then count it, it is persisted in memory and I see it in the web UI. Something else must be different about what's

RE: Creating Partitioned Parquet Tables via SparkSQL

2015-04-01 Thread Felix Cheung
This is tracked by these JIRAs.. https://issues.apache.org/jira/browse/SPARK-5947 https://issues.apache.org/jira/browse/SPARK-5948 From: denny.g@gmail.com Date: Wed, 1 Apr 2015 04:35:08 + Subject: Creating Partitioned Parquet Tables via SparkSQL To: user@spark.apache.org Creating

Spark + Kafka

2015-04-01 Thread James King
I have a simple setup/runtime of Kafka and Sprak. I have a command line consumer displaying arrivals to Kafka topic. So i know messages are being received. But when I try to read from Kafka topic I get no messages, here are some logs below. I'm thinking there aren't enough threads. How do i

Re: Unable to save dataframe with UDT created with sqlContext.createDataFrame

2015-04-01 Thread Jaonary Rabarisoa
Hmm, I got the same error with the master. Here is another test example that fails. Here, I explicitly create a Row RDD which corresponds to the use case I am in : *object TestDataFrame { def main(args: Array[String]): Unit = { val conf = new

Re: Using 'fair' scheduler mode

2015-04-01 Thread Mark Hamstra
I am using the Spark ‘fair’ scheduler mode. What do you mean by this? Fair scheduling mode is not one thing in Spark, but allows for multiple configurations and usages. Presumably, at a minimum you are using SparkConf to set spark.scheduling.mode to FAIR, but then how are you setting up

Re: Re: rdd.cache() not working ?

2015-04-01 Thread Yuri Makhno
cache() method returns new RDD so you have to use something like this: val person = sc.textFile(hdfs://namenode_host:8020/user/person.txt).map(_.split(,)).map(p = Person(p(0).trim.toInt, p(1))) val cached = person.cache cached.count when you rerun count on cached you will see that cache

Re: Re: rdd.cache() not working ?

2015-04-01 Thread fightf...@163.com
Hi Still no good luck with your guide. Best. Sun. fightf...@163.com From: Yuri Makhno Date: 2015-04-01 15:26 To: fightf...@163.com CC: Taotao.Li; user Subject: Re: Re: rdd.cache() not working ? cache() method returns new RDD so you have to use something like this: val person =

Re: Spark + Kafka

2015-04-01 Thread bit1...@163.com
Please make sure that you have given more cores than Receiver numbers. From: James King Date: 2015-04-01 15:21 To: user Subject: Spark + Kafka I have a simple setup/runtime of Kafka and Sprak. I have a command line consumer displaying arrivals to Kafka topic. So i know messages are being

Re: Spark + Kafka

2015-04-01 Thread James King
Thanks Saisai, Sure will do. But just a quick note that when i set master as local[*] Spark started showing Kafka messages as expected, so the problem in my view was to do with not enough threads to process the incoming data. Thanks. On Wed, Apr 1, 2015 at 10:53 AM, Saisai Shao

Re: Disable stage logging to stdout

2015-04-01 Thread Sean Owen
You can disable with spark.ui.showConsoleProgress=false but I also wasn't sure why this writes straight to System.err, at first. I assume it's because it's writing carriage returns to achieve the animation and this won't work via a logging framework. stderr is where log-like output goes, because

Re: Spark + Kafka

2015-04-01 Thread Saisai Shao
Would you please share your code snippet please, so we can identify is there anything wrong in your code. Beside would you please grep your driver's debug log to see if there's any debug log about Stream xxx received block xxx, this means that Spark Streaming is keeping receiving data from

Disable stage logging to stdout

2015-04-01 Thread Theodore Vasiloudis
Since switching to Spark 1.2.1 I'm seeing logging for the stage progress (ex.): [error] [Stage 2154: (14 + 8) / 48][Stage 2210: (0 + 0) / 48] Any reason why these are error level logs? Shouldn't they be info level? In any case is there a way to disable them other than

RE: Spark + Kafka

2015-04-01 Thread Shao, Saisai
OK, seems there’s nothing strange from your code. So maybe we need to narrow down the cause, would you please run KafkaWordCount example in Spark to see if it is OK, if this is OK, then we should focus on your implementation, otherwise Kafka potentially has some problems. Thanks Jerry From:

Spark 1.3.0 DataFrame and Postgres

2015-04-01 Thread fergjo00
Question: --- Is there a way to have JDBC DataFrames use quoted/escaped column names? Right now, it looks like it sees the names correctly in the schema created but does not escape them in the SQL it creates when they are not compliant: org.apache.spark.sql.jdbc.JDBCRDD private

Re: Re: rdd.cache() not working ?

2015-04-01 Thread fightf...@163.com
Hi That is just the issue. After running person.cache we then run person.count however, there still not be any cache performance showed from web ui storage. Thanks, Sun. fightf...@163.com From: Taotao.Li Date: 2015-04-01 14:02 To: fightfate CC: user Subject: Re: rdd.cache() not working ?

Re: Using 'fair' scheduler mode

2015-04-01 Thread Sean Owen
Does the expensive query take all executor slots? Then there is nothing for any other job to use regardless of scheduling policy. On Mar 31, 2015 9:20 PM, asadrao as...@microsoft.com wrote: Hi, I am using the Spark ‘fair’ scheduler mode. I have noticed that if the first query is a very

Re: Re: rdd.cache() not working ?

2015-04-01 Thread fightf...@163.com
Hi all Thanks a lot for caspuring this. We are now using 1.3.0 release. We tested with both prebuilt version spark and source code compiling version targeting our CDH component, and the cache result did not show as expected. However, if we create dataframe with the person rdd and using

Re: Spark + Kafka

2015-04-01 Thread James King
Thank you bit1129, From looking at the web UI i can see 2 cores Also looking at http://spark.apache.org/docs/1.2.1/configuration.html But can't see obvious configuration for number of receivers can you help please. On Wed, Apr 1, 2015 at 9:39 AM, bit1...@163.com bit1...@163.com wrote:

Re: Creating Partitioned Parquet Tables via SparkSQL

2015-04-01 Thread Denny Lee
Thanks Felix :) On Wed, Apr 1, 2015 at 00:08 Felix Cheung felixcheun...@hotmail.com wrote: This is tracked by these JIRAs.. https://issues.apache.org/jira/browse/SPARK-5947 https://issues.apache.org/jira/browse/SPARK-5948 -- From: denny.g@gmail.com Date:

Re: Spark Streaming and JMS

2015-04-01 Thread danila
Hi Tathagata do you know if JMS Reciever was introduced during last year as standard Spark component or somebody is developing it? Regards Danila -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-and-JMS-tp5371p22337.html Sent from the

Re: Spark + Kafka

2015-04-01 Thread James King
This is the code. And I couldn't find anything like the log you suggested. public KafkaLogConsumer(int duration, String master) { JavaStreamingContext spark = createSparkContext(duration, master); MapString, Integer topics = new HashMapString, Integer(); topics.put(test, 1);

Re: Spark 1.3.0 DataFrame and Postgres

2015-04-01 Thread Ted Yu
+1 on escaping column names. On Apr 1, 2015, at 5:50 AM, fergjo00 johngfergu...@gmail.com wrote: Question: --- Is there a way to have JDBC DataFrames use quoted/escaped column names? Right now, it looks like it sees the names correctly in the schema created but does not

Re: Disable stage logging to stdout

2015-04-01 Thread Theodore Vasiloudis
Thank you Sean that does the trick. On Wed, Apr 1, 2015 at 12:05 PM, Sean Owen so...@cloudera.com wrote: You can disable with spark.ui.showConsoleProgress=false but I also wasn't sure why this writes straight to System.err, at first. I assume it's because it's writing carriage returns to

Spark throws rsync: change_dir errors on startup

2015-04-01 Thread Horsmann, Tobias
Hi, I try to set up a minimal 2-node spark cluster for testing purposes. When I start the cluster with start-all.sh I get a rsync error message: rsync: change_dir /usr/local/spark130/sbin//right failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see

RE: --driver-memory parameter doesn't work for spark-submmit on yarn?

2015-04-01 Thread Shuai Zheng
Hi Akhil, Thanks a lot! After set export _JAVA_OPTIONS=-Xmx5g, the OutOfMemory exception disappeared. But this make me confused, so the driver-memory options doesn’t work for spark-submit to YARN (I haven’t check other clusters), is it a bug? Regards, Shuai From: Akhil Das

Re: Spark 1.3 build with hive support fails on JLine

2015-04-01 Thread Ted Yu
Please invoke dev/change-version-to-2.11.sh before running mvn. Cheers On Mon, Mar 30, 2015 at 1:02 AM, Night Wolf nightwolf...@gmail.com wrote: Hey, Trying to build Spark 1.3 with Scala 2.11 supporting yarn hive (with thrift server). Running; *mvn -e -DskipTests -Pscala-2.11

Size of arbitrary state managed via DStream updateStateByKey

2015-04-01 Thread Vinoth Chandar
Hi all, As I understand from docs and talks, the streaming state is in memory as RDD (optionally checkpointable to disk). SPARK-2629 hints that this in memory structure is not indexed efficiently? I am wondering how my performance would be if the streaming state does not fit in memory (say 100GB

Re: deployment of spark on mesos and data locality in tachyon/hdfs

2015-04-01 Thread Haoyuan Li
Response inline. On Tue, Mar 31, 2015 at 10:41 PM, Sean Bigdatafun sean.bigdata...@gmail.com wrote: (resending...) I was thinking the same setup… But the more I think of this problem, and the more interesting this could be. If we allocate 50% total memory to Tachyon statically, then the

Issue on Spark SQL insert or create table with Spark running on AWS EMR -- s3n.S3NativeFileSystem: rename never finished

2015-04-01 Thread chutium
Hi, we always get issues on inserting or creating table with Amazon EMR Spark version, by inserting about 1GB resultset, the spark sql query will never be finished. by inserting small resultset (like 500MB), works fine. *spark.sql.shuffle.partitions* by default 200 or *set

Re: --driver-memory parameter doesn't work for spark-submmit on yarn?

2015-04-01 Thread Sean Owen
I feel like I recognize that problem, and it's almost the inverse of https://issues.apache.org/jira/browse/SPARK-3884 which I was looking at today. The spark-class script didn't seem to handle all the ways that driver memory can be set. I think this is also something fixed by the new launcher

Error reading smallin in hive table with parquet format

2015-04-01 Thread Masf
Hi. In Spark SQL 1.2.0, with HiveContext, I'm executing the following statement: CREATE TABLE testTable STORED AS PARQUET AS SELECT field1 FROM table1 *field1 is SMALLINT. If table1 is in text format all it's ok, but if table1 is in parquet format, spark returns the following error*:

Spark 1.3.0 missing dependency?

2015-04-01 Thread ARose
Upon executing these two lines of code: conf = new SparkConf().setAppName(appName).setMaster(master); sc = new JavaSparkContext(conf); I get the following error message: ERROR Configuration: Failed to set setXIncludeAware(true) for parser

Re: Spark 1.3.0 missing dependency?

2015-04-01 Thread Sean Owen
No, it means you have a conflicting version of Xerces somewhere in your classpath. Maybe your app is bundling an old version? On Wed, Apr 1, 2015 at 4:46 PM, ARose ashley.r...@telarix.com wrote: Upon executing these two lines of code: conf = new

Spark Streaming 1.3 Kafka Direct Streams

2015-04-01 Thread Neelesh
With receivers, it was pretty obvious which code ran where - each receiver occupied a core and ran on the workers. However, with the new kafka direct input streams, its hard for me to understand where the code that's reading from kafka brokers runs. Does it run on the driver (I hope not), or does

Use with Data justifying Spark

2015-04-01 Thread Vila, Didier
Good Morning All, I would like to use Spark in a special synthetics case that justifies the use of spark. So , I am looking for a case based on data ( can be big) and eventually the associated java and/or python code. It will be fantastic if you can refer me a link where I can load this

pyspark hbase range scan

2015-04-01 Thread Eric Kimbrel
I am attempting to read an hbase table in pyspark with a range scan. conf = { hbase.zookeeper.quorum: host, hbase.mapreduce.inputtable: table, hbase.mapreduce.scan : scan } hbase_rdd = sc.newAPIHadoopRDD( org.apache.hadoop.hbase.mapreduce.TableInputFormat,

SparkR newHadoopAPIRDD

2015-04-01 Thread Corey Nolet
How hard would it be to expose this in some way? I ask because the current textFile and objectFile functions are obviously at some point calling out to a FileInputFormat and configuring it. Could we get a way to configure any arbitrary inputformat / outputformat?

Data locality across jobs

2015-04-01 Thread kjsingh
Hi, We are running an hourly job using Spark 1.2 on Yarn. It saves an RDD of Tuple2. At the end of day, a daily job is launched, which works on the outputs of the hourly jobs. For data locality and speed, we wish that when the daily job launches, it finds all instances of a given key at a single

Re: Spark, snappy and HDFS

2015-04-01 Thread Nick Travers
I'm actually running this in a separate environment to our HDFS cluster. I think I've been able to sort out the issue by copying /opt/cloudera/parcels/CDH/lib to the machine I'm running this on (I'm just using a one-worker setup at present) and adding the following to spark-env.sh: export

RE: Streaming anomaly detection using ARIMA

2015-04-01 Thread Felix Cheung
I'm curious - I'm not sure if I understand you correctly. With SparkR, the work is distributed in Spark and computed in R, isn't that what your are looking for? SparkR was on rJava for the R-JVM but moved away from it. rJava has a component called JRI which allows JVM to call R. You could call

Re: Spark SQL saveAsParquet failed after a few waves

2015-04-01 Thread Yijie Shen
I have 7 workers for spark and set SPARK_WORKER_CORES=12, therefore 84 tasks in one job can run simultaneously, I call the tasks in a job started almost  simultaneously a wave. While inserting, there is only one job on spark, not inserting from multiple programs concurrently. —  Best Regards!

Re: Strategy regarding maximum number of executor's failure for log running jobs/ spark streaming jobs

2015-04-01 Thread twinkle sachdeva
Hi, Thanks Sandy. Another way to look at this is that would we like to have our long running application to die? So let's say, we create a window of around 10 batches, and we are using incremental kind of operations inside our application, as restart here is a relatively more costlier, so

Re: Spark, snappy and HDFS

2015-04-01 Thread Xianjin YE
Can you read snappy compressed file in hdfs? Looks like the libsnappy.so is not in the hadoop native lib path. On Thursday, April 2, 2015 at 10:13 AM, Nick Travers wrote: Has anyone else encountered the following error when trying to read a snappy compressed sequence file from HDFS?

RE: Can I call aggregate UDF in DataFrame?

2015-04-01 Thread Haopu Wang
Great! Thank you! From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, April 02, 2015 8:11 AM To: Haopu Wang Cc: user; d...@spark.apache.org Subject: Re: Can I call aggregate UDF in DataFrame? You totally can.

Re: Can I call aggregate UDF in DataFrame?

2015-04-01 Thread Reynold Xin
You totally can. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L792 There is also an attempt at adding stddev here already: https://github.com/apache/spark/pull/5228 On Thu, Mar 26, 2015 at 12:37 AM, Haopu Wang hw...@qilinsoft.com

Re: Quick GraphX gutcheck

2015-04-01 Thread Takeshi Yamamuro
hi, Yes, you're right. Original VertexIds are used to join them in VeretexRDD#xxxJoin. On Thu, Apr 2, 2015 at 7:31 AM, hokiegeek2 soozandjohny...@gmail.com wrote: Hi Everyone, Quick (hopefully) and silly (likely) question--the VertexId can be used to join the VertexRDD generated from

Re: Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0

2015-04-01 Thread Xiangrui Meng
Ravi, we just merged https://issues.apache.org/jira/browse/SPARK-6642 and used the same lambda scaling as in 1.2. The change will be included in Spark 1.3.1, which will be released soon. Thanks for reporting this issue! -Xiangrui On Tue, Mar 31, 2015 at 8:53 PM, Xiangrui Meng men...@gmail.com

Re: Spark SQL does not read from cached table if table is renamed

2015-04-01 Thread Michael Armbrust
This is fixed in Spark 1.3. https://issues.apache.org/jira/browse/SPARK-5195 On Wed, Apr 1, 2015 at 4:05 PM, Judy Nash judyn...@exchange.microsoft.com wrote: Hi all, Noticed a bug in my current version of Spark 1.2.1. After a table is cached with “cache table table” command, query will

Re: Streaming anomaly detection using ARIMA

2015-04-01 Thread Corey Nolet
Surprised I haven't gotten any responses about this. Has anyone tried using rJava or FastR w/ Spark? I've seen the SparkR project but thta goes the other way- what I'd like to do is use R for model calculation and Spark to distribute the load across the cluster. Also, has anyone used Scalation

Re: pyspark hbase range scan

2015-04-01 Thread Ted Yu
Have you looked at http://happybase.readthedocs.org/en/latest/ ? Cheers On Apr 1, 2015, at 4:50 PM, Eric Kimbrel eric.kimb...@soteradefense.com wrote: I am attempting to read an hbase table in pyspark with a range scan. conf = { hbase.zookeeper.quorum: host,

Spark, snappy and HDFS

2015-04-01 Thread Nick Travers
Has anyone else encountered the following error when trying to read a snappy compressed sequence file from HDFS? *java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z* The following works for me when the file is uncompressed: import

Re: Spark, snappy and HDFS

2015-04-01 Thread Nick Travers
Thanks for the super quick response! I can read the file just fine in hadoop, it's just when I point Spark at this file it can't seem to read it due to the missing snappy jars / so's. I'l paying around with adding some things to spark-env.sh file, but still nothing. On Wed, Apr 1, 2015 at 7:19

Re: Spark throws rsync: change_dir errors on startup

2015-04-01 Thread Akhil Das
Error 23 is defined as a partial transfer and might be caused by filesystem incompatibilities, such as different character sets or access control lists. In this case it could be caused by the double slashes (// at the end of sbin), You could try editing your sbin/spark-daemon.sh file, look for

Latest enhancement in Low Level Receiver based Kafka Consumer

2015-04-01 Thread Dibyendu Bhattacharya
Hi, Just to let you know, I have made some enhancement in Low Level Reliable Receiver based Kafka Consumer ( http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) . Earlier version uses as many Receiver task for number of partitions of your kafka topic . Now you can configure desired

Re: Unable to run Spark application

2015-04-01 Thread Marcelo Vanzin
Try sbt assembly instead. On Wed, Apr 1, 2015 at 10:09 AM, Vijayasarathy Kannan kvi...@vt.edu wrote: Why do I get Failed to find Spark assembly JAR. You need to build Spark before running this program. ? I downloaded spark-1.2.1.tgz from the downloads page and extracted it. When I do sbt

Re: HiveContext setConf seems not stable

2015-04-01 Thread Michael Armbrust
Can you open a JIRA please? On Wed, Apr 1, 2015 at 9:38 AM, Hao Ren inv...@gmail.com wrote: Hi, I find HiveContext.setConf does not work correctly. Here are some code snippets showing the problem: snippet 1:

Re: SparkSQL - Caching RDDs

2015-04-01 Thread Michael Armbrust
What do you mean by permanently. If you start up the JDBC server and say CACHE TABLE it will stay cached as long as the server is running. CACHE TABLE is idempotent, so you could even just have that command in your BI tools setup queries. On Wed, Apr 1, 2015 at 11:02 AM, Venkat, Ankam

Re: Spark Streaming 1.3 Kafka Direct Streams

2015-04-01 Thread Neelesh
Thanks Cody, that was really helpful. I have a much better understanding now. One last question - Kafka topics are initialized once in the driver, is there an easy way of adding/removing topics on the fly? KafkaRDD#getPartitions() seems to be computed only once, and no way of refreshing them.

Re: How to specify the port for AM Actor ...

2015-04-01 Thread Manoj Samel
Filed https://issues.apache.org/jira/browse/SPARK-6653 On Sun, Mar 29, 2015 at 8:18 PM, Shixiong Zhu zsxw...@gmail.com wrote: LGTM. Could you open a JIRA and send a PR? Thanks. Best Regards, Shixiong Zhu 2015-03-28 7:14 GMT+08:00 Manoj Samel manojsamelt...@gmail.com: I looked @ the 1.3.0

Re: Spark Streaming 1.3 Kafka Direct Streams

2015-04-01 Thread Cody Koeninger
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md The kafka consumers run in the executors. On Wed, Apr 1, 2015 at 11:18 AM, Neelesh neele...@gmail.com wrote: With receivers, it was pretty obvious which code ran where - each receiver occupied a core and ran on the

Re: Error reading smallin in hive table with parquet format

2015-04-01 Thread Michael Armbrust
Can you try with Spark 1.3? Much of this code path has been rewritten / improved in this version. On Wed, Apr 1, 2015 at 7:53 AM, Masf masfwo...@gmail.com wrote: Hi. In Spark SQL 1.2.0, with HiveContext, I'm executing the following statement: CREATE TABLE testTable STORED AS PARQUET AS

Re: Use with Data justifying Spark

2015-04-01 Thread Sonal Goyal
Maybe check the examples? http://spark.apache.org/examples.html Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Wed, Apr 1, 2015 at 8:31 PM, Vila, Didier didier.v...@teradata.com wrote: Good Morning All, I would like to use

Re: Spark Streaming 1.3 Kafka Direct Streams

2015-04-01 Thread Cody Koeninger
If you want to change topics from batch to batch, you can always just create a KafkaRDD repeatedly. The streaming code as it stands assumes a consistent set of topics though. The implementation is private so you cant subclass it without building your own spark. On Wed, Apr 1, 2015 at 1:09 PM,

Re: Strategy regarding maximum number of executor's failure for log running jobs/ spark streaming jobs

2015-04-01 Thread Sandy Ryza
That's a good question, Twinkle. One solution could be to allow a maximum number of failures within any given time span. E.g. a max failures per hour property. -Sandy On Tue, Mar 31, 2015 at 11:52 PM, twinkle sachdeva twinkle.sachd...@gmail.com wrote: Hi, In spark over YARN, there is a

Re: Latest enhancement in Low Level Receiver based Kafka Consumer

2015-04-01 Thread Neelesh
Hi Dibyendu, Thanks for your work on this project. Spark 1.3 now has direct kafka streams, but still does not provide enough control over partitions and topics. For example, the streams are fairly statically configured - RDD.getPartitions() is computed only once, thus making it difficult to use

Re: Unable to run Spark application

2015-04-01 Thread Vijayasarathy Kannan
That is failing too, with sbt.resolveexception: unresolved dependency:org.apache.spark#spark-network-common_2.10;1.2.1 On Wed, Apr 1, 2015 at 1:24 PM, Marcelo Vanzin van...@cloudera.com wrote: Try sbt assembly instead. On Wed, Apr 1, 2015 at 10:09 AM, Vijayasarathy Kannan kvi...@vt.edu

HiveContext setConf seems not stable

2015-04-01 Thread Hao Ren
Hi, I find HiveContext.setConf does not work correctly. Here are some code snippets showing the problem: snippet 1: import org.apache.spark.sql.hive.HiveContext import

Re: Spark 1.3.0 DataFrame and Postgres

2015-04-01 Thread Michael Armbrust
Can you open a JIRA for this please? On Wed, Apr 1, 2015 at 6:14 AM, Ted Yu yuzhih...@gmail.com wrote: +1 on escaping column names. On Apr 1, 2015, at 5:50 AM, fergjo00 johngfergu...@gmail.com wrote: Question: --- Is there a way to have JDBC DataFrames use quoted/escaped

RE: --driver-memory parameter doesn't work for spark-submmit on yarn?

2015-04-01 Thread Shuai Zheng
Nice. But when my case shows that even I use Yarn-Client, I have same issue. I do verify it several times. And I am running 1.3.0 on EMR (use the version dispatch by installSpark script from AWS). I agree _JAVA_OPTIONS is not a right solution, but I will use it until 1.4.0 out :) Regards,

SparkSQL - Caching RDDs

2015-04-01 Thread Venkat, Ankam
I am trying to integrate SparkSQL with a BI tool. My requirement is to query a Hive table very frequently from the BI tool. Is there a way to cache the Hive Table permanently in SparkSQL? I don't want to read the Hive table and cache it everytime the query is submitted from BI tool. Thanks!

Re: Broadcasting a parquet file using spark and python

2015-04-01 Thread Michael Armbrust
You will need to create a hive parquet table that points to the data and run ANALYZE TABLE tableName noscan so that we have statistics on the size. On Tue, Mar 31, 2015 at 9:36 PM, Jitesh chandra Mishra jitesh...@gmail.com wrote: Hi Michael, Thanks for your response. I am running 1.2.1. Is

persist(MEMORY_ONLY) takes lot of time

2015-04-01 Thread SamyaMaiti
Hi Experts, I have a parquet dataset of 550 MB ( 9 Blocks) in HDFS. I want to run SQL queries repetitively. Few questions : 1. When I do the below (persist to memory after reading from disk), it takes lot of time to persist to memory, any suggestions of how to tune this? val inputP

Re: Spark Streaming 1.3 Kafka Direct Streams

2015-04-01 Thread Cody Koeninger
As I said in the original ticket, I think the implementation classes should be exposed so that people can subclass and override compute() to suit their needs. Just adding a function from Time = Set[TopicAndPartition] wouldn't be sufficient for some of my current production use cases. compute()

RE: Use with Data justifying Spark

2015-04-01 Thread Vila, Didier
Sonal, Thanks for the link ( I worked with them yet) Do you have any other example more “exotics” where I can play with specific data too and algo ? Cheers, Didier From: Sonal Goyal [mailto:sonalgoy...@gmail.com] Sent: Wednesday, April 01, 2015 7:09 PM To: Vila, Didier Cc:

Re: Size of arbitrary state managed via DStream updateStateByKey

2015-04-01 Thread Vinoth Chandar
Thanks for confirming! On Wed, Apr 1, 2015 at 12:33 PM, Tathagata Das t...@databricks.com wrote: In the current state yes there will be performance issues. It can be done much more efficiently and we are working on doing that. TD On Wed, Apr 1, 2015 at 7:49 AM, Vinoth Chandar

Spark on EC2

2015-04-01 Thread Vadim Bichutskiy
Hi all, I just tried launching a Spark cluster on EC2 as described in http://spark.apache.org/docs/1.3.0/ec2-scripts.html I got the following response: *ResponseErrorsErrorCodePendingVerification/CodeMessageYour account is currently being verified. Verification normally takes less than 2

Re: Spark Streaming 1.3 Kafka Direct Streams

2015-04-01 Thread Neelesh
Thanks Cody! On Wed, Apr 1, 2015 at 11:21 AM, Cody Koeninger c...@koeninger.org wrote: If you want to change topics from batch to batch, you can always just create a KafkaRDD repeatedly. The streaming code as it stands assumes a consistent set of topics though. The implementation is

Re: Spark Streaming 1.3 Kafka Direct Streams

2015-04-01 Thread Tathagata Das
We should be able to support that use case in the direct API. It may be as simple as allowing the users to pass on a function that returns the set of topic+partitions to read from. That is function (Time) = Set[TopicAndPartition] This gets called every batch interval before the offsets are

Re: Spark Streaming and JMS

2015-04-01 Thread Tathagata Das
Its not a built in component of Spark. However there is a spark-package for Apache Camel receiver which can integrate with JMS. http://spark-packages.org/package/synsys/spark I have not tried it but do check it out. TD On Wed, Apr 1, 2015 at 4:38 AM, danila danila.erma...@gmail.com wrote: Hi

Re: Size of arbitrary state managed via DStream updateStateByKey

2015-04-01 Thread Tathagata Das
In the current state yes there will be performance issues. It can be done much more efficiently and we are working on doing that. TD On Wed, Apr 1, 2015 at 7:49 AM, Vinoth Chandar vin...@uber.com wrote: Hi all, As I understand from docs and talks, the streaming state is in memory as RDD

Unable to run Spark application

2015-04-01 Thread Vijayasarathy Kannan
Why do I get Failed to find Spark assembly JAR. You need to build Spark before running this program. ? I downloaded spark-1.2.1.tgz from the downloads page and extracted it. When I do sbt package inside my application, it worked fine. But when I try to run my application, I get the above

Re: Spark Streaming S3 Performance Implications

2015-04-01 Thread Mike Trienis
Hey Chris, Apologies for the delayed reply. Your responses are always insightful and appreciated :-) However, I have a few more questions. also, it looks like you're writing to S3 per RDD. you'll want to broaden that out to write DStream batches I assume you mean dstream.saveAsTextFiles()

Re: Spark SQL saveAsParquet failed after a few waves

2015-04-01 Thread Michael Armbrust
When few waves (1 or 2) are used in a job, LoadApp could finish after a few failures and retries. But when more waves (3) are involved in a job, the job would terminate abnormally. Can you clarify what you mean by waves? Are you inserting from multiple programs concurrently?

Spark 1.3.0 DataFrame count() method throwing java.io.EOFException

2015-04-01 Thread ARose
Note: I am running Spark on Windows 7 in standalone mode. In my app, I run the following: DataFrame df = sqlContext.sql(SELECT * FROM tbBER); System.out.println(Count: + df.count()); tbBER is registered as a temp table in my SQLContext. When I try to print the number of rows in

Re: Spark permission denied error when invoking saveAsTextFile

2015-04-01 Thread Kannan Rajah
Ignore the question. There was a Hadoop setting that needed to be set to get it working. -- Kannan On Wed, Apr 1, 2015 at 1:37 PM, Kannan Rajah kra...@maprtech.com wrote: Running a simple word count job in standalone mode as a non root user from spark-shell. The spark master, worker services

Re: Spark 1.3.0 DataFrame count() method throwing java.io.EOFException

2015-04-01 Thread Dean Wampler
Is it possible tbBER is empty? If so, it shouldn't fail like this, of course. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler

Re: Spark-EC2 Security Group Error

2015-04-01 Thread Daniil Osipov
Appears to be a problem with boto. Make sure you have boto 2.34 on your system. On Wed, Apr 1, 2015 at 11:19 AM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: Hi all – I’m trying to bring up a spark ec2 cluster with the script below and see the following error. Can anyone please advise as

Re: How to start master and workers on Windows

2015-04-01 Thread ARose
I'm in the same boat. What are the equivalent commands to stop the master and workers? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-start-master-and-workers-on-Windows-tp12669p22341.html Sent from the Apache Spark User List mailing list archive at

Re: Spark Streaming 1.3 Kafka Direct Streams

2015-04-01 Thread Tathagata Das
The challenge of opening up these internal classes to public (even with Developer API tag) is that it prevents us from making non-trivial changes without breaking API compatibility for all those who had subclassed. Its a tradeoff that is hard to optimize. That's why we favor exposing more optional

Re: Unable to run Spark application

2015-04-01 Thread Vijayasarathy Kannan
Managed to make sbt assembly work. I run into another issue now. When I do ./sbin/start-all.sh, the script fails saying JAVA_HOME is not set although I have explicitly set that variable to point to the correct Java location. Same happens with ./sbin/start-master.sh script. Any idea what I might

  1   2   >