StackOverflow Problem with 1.3 mllib ALS

2015-04-01 Thread Justin Yip
Hello, I have been using Mllib's ALS in 1.2 and it works quite well. I have just upgraded to 1.3 and I encountered stackoverflow problem. After some digging, I realized that when the iteration > ~35, I will get overflow problem. However, I can get at least 80 iterations with ALS in 1.2. Is there

Starting httpd: http: Syntax error on line 154

2015-04-01 Thread Ganon Pierce
I’m unable to access ganglia, it looks like due the web server not starting as I receive this error when I launch spark: Starting httpd: http: Syntax error on line 154 of /etc/httpd/conf/httpd.conf: Cannot load /etc/httpd/modules/mod_authz_core.so This occurs when I’m using the vanilla script.

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-01 Thread Akhil Das
Try adding all the jars in your $HIVE/lib directory. If you want the specific jar, you could look fr jackson or json serde in it. Thanks Best Regards On Thu, Apr 2, 2015 at 12:49 AM, Todd Nist wrote: > I have a feeling I’m missing a Jar that provides the support or could this > may be related t

Re: Spark throws rsync: change_dir errors on startup

2015-04-01 Thread Akhil Das
Error 23 is defined as a "partial transfer" and might be caused by filesystem incompatibilities, such as different character sets or access control lists. In this case it could be caused by the double slashes (// at the end of sbin), You could try editing your sbin/spark-daemon.sh file, look for rs

RE: Streaming anomaly detection using ARIMA

2015-04-01 Thread Felix Cheung
I'm curious - I'm not sure if I understand you correctly. With SparkR, the work is distributed in Spark and computed in R, isn't that what your are looking for? SparkR was on rJava for the R-JVM but moved away from it. rJava has a component called JRI which allows JVM to call R. You could call R

RE: Can I call aggregate UDF in DataFrame?

2015-04-01 Thread Haopu Wang
Great! Thank you! From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, April 02, 2015 8:11 AM To: Haopu Wang Cc: user; d...@spark.apache.org Subject: Re: Can I call aggregate UDF in DataFrame? You totally can. https://github.com/apache/spark/

Data locality across jobs

2015-04-01 Thread kjsingh
Hi, We are running an hourly job using Spark 1.2 on Yarn. It saves an RDD of Tuple2. At the end of day, a daily job is launched, which works on the outputs of the hourly jobs. For data locality and speed, we wish that when the daily job launches, it finds all instances of a given key at a single

Incorrect results of spark sql on hive select

2015-04-01 Thread LinQili
Hi all:I was using spark sql on hive for doing some statistics. While using spark sql, I found that, sometimes, spark-sql got the incorrect results while select. And if I executed the same select statement again, it returned the different results which was correct. That is saying, sometime, with

Re: Quick GraphX gutcheck

2015-04-01 Thread Takeshi Yamamuro
hi, Yes, you're right. Original VertexIds are used to join them in VeretexRDD#xxxJoin. On Thu, Apr 2, 2015 at 7:31 AM, hokiegeek2 wrote: > Hi Everyone, > > Quick (hopefully) and silly (likely) question--the VertexId can be used to > join the VertexRDD generated from Graph.vertices with a transf

Re: Spark, snappy and HDFS

2015-04-01 Thread Nick Travers
I'm actually running this in a separate environment to our HDFS cluster. I think I've been able to sort out the issue by copying /opt/cloudera/parcels/CDH/lib to the machine I'm running this on (I'm just using a one-worker setup at present) and adding the following to spark-env.sh: export JAVA_LI

Re: Spark, snappy and HDFS

2015-04-01 Thread Xianjin YE
Do you have the same hadoop config for all nodes in your cluster(you run it in a cluster, right?)? Check the node(usually the executor) which gives the java.lang.UnsatisfiedLinkError to see whether the libsnappy.so is in the hadoop native lib path. On Thursday, April 2, 2015 at 10:22 AM, Nic

Re: How to setup a Spark Cluter?

2015-04-01 Thread amghost
First of all, you should download a pre-build version for Hadoop of Spark, according to the version of Hadoop you are using, Next, you may find useful infos in Spark docs: http://spark.apache.org/docs/latest/running-on-yarn.html There are some content about deploying in cluster mode in Spark's pr

Re: Spark, snappy and HDFS

2015-04-01 Thread Nick Travers
Thanks for the super quick response! I can read the file just fine in hadoop, it's just when I point Spark at this file it can't seem to read it due to the missing snappy jars / so's. I'l paying around with adding some things to spark-env.sh file, but still nothing. On Wed, Apr 1, 2015 at 7:19 P

Re: Spark, snappy and HDFS

2015-04-01 Thread Xianjin YE
Can you read snappy compressed file in hdfs? Looks like the libsnappy.so is not in the hadoop native lib path. On Thursday, April 2, 2015 at 10:13 AM, Nick Travers wrote: > Has anyone else encountered the following error when trying to read a snappy > compressed sequence file from HDFS? > > *j

Spark, snappy and HDFS

2015-04-01 Thread Nick Travers
Has anyone else encountered the following error when trying to read a snappy compressed sequence file from HDFS? *java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z* The following works for me when the file is uncompressed: import org.apache.hadoop.io.

Re: Strategy regarding maximum number of executor's failure for log running jobs/ spark streaming jobs

2015-04-01 Thread twinkle sachdeva
Hi, Thanks Sandy. Another way to look at this is that would we like to have our long running application to die? So let's say, we create a window of around 10 batches, and we are using incremental kind of operations inside our application, as restart here is a relatively more costlier, so shoul

Re: Spark SQL saveAsParquet failed after a few waves

2015-04-01 Thread Yijie Shen
I have 7 workers for spark and set SPARK_WORKER_CORES=12, therefore 84 tasks in one job can run simultaneously, I call the tasks in a job started almost  simultaneously a wave. While inserting, there is only one job on spark, not inserting from multiple programs concurrently. —  Best Regards! Y

SparkR newHadoopAPIRDD

2015-04-01 Thread Corey Nolet
How hard would it be to expose this in some way? I ask because the current textFile and objectFile functions are obviously at some point calling out to a FileInputFormat and configuring it. Could we get a way to configure any arbitrary inputformat / outputformat?

Re: Can I call aggregate UDF in DataFrame?

2015-04-01 Thread Reynold Xin
You totally can. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L792 There is also an attempt at adding stddev here already: https://github.com/apache/spark/pull/5228 On Thu, Mar 26, 2015 at 12:37 AM, Haopu Wang wrote: > Specifically

Re: Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0

2015-04-01 Thread Xiangrui Meng
Ravi, we just merged https://issues.apache.org/jira/browse/SPARK-6642 and used the same lambda scaling as in 1.2. The change will be included in Spark 1.3.1, which will be released soon. Thanks for reporting this issue! -Xiangrui On Tue, Mar 31, 2015 at 8:53 PM, Xiangrui Meng wrote: > I created a

Re: pyspark hbase range scan

2015-04-01 Thread Ted Yu
Have you looked at http://happybase.readthedocs.org/en/latest/ ? Cheers > On Apr 1, 2015, at 4:50 PM, Eric Kimbrel > wrote: > > I am attempting to read an hbase table in pyspark with a range scan. > > conf = { >"hbase.zookeeper.quorum": host, >"hbase.mapreduce.inputtable": table,

pyspark hbase range scan

2015-04-01 Thread Eric Kimbrel
I am attempting to read an hbase table in pyspark with a range scan. conf = { "hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable": table, "hbase.mapreduce.scan" : scan } hbase_rdd = sc.newAPIHadoopRDD( "org.apache.hadoop.hbase.mapreduce.TableInputFormat", "o

Re: Streaming anomaly detection using ARIMA

2015-04-01 Thread Corey Nolet
Surprised I haven't gotten any responses about this. Has anyone tried using rJava or FastR w/ Spark? I've seen the SparkR project but thta goes the other way- what I'd like to do is use R for model calculation and Spark to distribute the load across the cluster. Also, has anyone used Scalation for

Re: Spark SQL does not read from cached table if table is renamed

2015-04-01 Thread Michael Armbrust
This is fixed in Spark 1.3. https://issues.apache.org/jira/browse/SPARK-5195 On Wed, Apr 1, 2015 at 4:05 PM, Judy Nash wrote: > Hi all, > > > > Noticed a bug in my current version of Spark 1.2.1. > > > > After a table is cached with “cache table ” command, query will not > read from memory if S

Spark SQL does not read from cached table if table is renamed

2015-04-01 Thread Judy Nash
Hi all, Noticed a bug in my current version of Spark 1.2.1. After a table is cached with "cache table " command, query will not read from memory if SQL query renames the table. This query reads from in memory table i.e. select hivesampletable.country from default.hivesampletable group by hive

Quick GraphX gutcheck

2015-04-01 Thread hokiegeek2
Hi Everyone, Quick (hopefully) and silly (likely) question--the VertexId can be used to join the VertexRDD generated from Graph.vertices with a transformed RDD where the keys are vertexIds from the original graph, correct? --John -- View this message in context: http://apache-spark-user-list

Re: Spark 1.3.0 DataFrame count() method throwing java.io.EOFException

2015-04-01 Thread Dean Wampler
Is it possible "tbBER" is empty? If so, it shouldn't fail like this, of course. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe @deanwampler http://polyglotprogr

Re: Spark permission denied error when invoking saveAsTextFile

2015-04-01 Thread Kannan Rajah
Ignore the question. There was a Hadoop setting that needed to be set to get it working. -- Kannan On Wed, Apr 1, 2015 at 1:37 PM, Kannan Rajah wrote: > Running a simple word count job in standalone mode as a non root user from > spark-shell. The spark master, worker services are running as ro

Spark 1.3.0 DataFrame count() method throwing java.io.EOFException

2015-04-01 Thread ARose
Note: I am running Spark on Windows 7 in standalone mode. In my app, I run the following: DataFrame df = sqlContext.sql("SELECT * FROM tbBER"); System.out.println("Count: " + df.count()); tbBER is registered as a temp table in my SQLContext. When I try to print the number of rows

Re: Spark-EC2 Security Group Error

2015-04-01 Thread Daniil Osipov
Appears to be a problem with boto. Make sure you have boto 2.34 on your system. On Wed, Apr 1, 2015 at 11:19 AM, Ganelin, Ilya wrote: > Hi all – I’m trying to bring up a spark ec2 cluster with the script below > and see the following error. Can anyone please advise as to what’s going > on? Is th

Re: Spark on EC2

2015-04-01 Thread Daniil Osipov
You're probably requesting more instances than allowed by your account, so the error gets generated for the extra instances. Try launching a smaller cluster. On Wed, Apr 1, 2015 at 12:41 PM, Vadim Bichutskiy < vadim.bichuts...@gmail.com> wrote: > Hi all, > > I just tried launching a Spark cluster

RE: --driver-memory parameter doesn't work for spark-submmit on yarn?

2015-04-01 Thread Bozeman, Christopher
Shuai, What did " ps aux | grep spark-submit" reveal? When you compare using _JAVA_OPTIONS and without using it, where do you see the difference? Thanks Christopher -Original Message- From: Shuai Zheng [mailto:szheng.c...@gmail.com] Sent: Wednesday, April 01, 2015 11:12 AM To: 'Sea

RE: Issue on Spark SQL insert or create table with Spark running on AWS EMR -- s3n.S3NativeFileSystem: rename never finished

2015-04-01 Thread Bozeman, Christopher
Teng, There is no need to alter hive.metastore.warehouse.dir. Leave it as is and just create external tables with location pointing to S3. What I suspect you are seeing is that spark-sql is writing to a temp directory within S3 then issuing a rename to the final location as would be done wi

Spark permission denied error when invoking saveAsTextFile

2015-04-01 Thread Kannan Rajah
Running a simple word count job in standalone mode as a non root user from spark-shell. The spark master, worker services are running as root user. The problem is the _temporary under /user/krajah/output2/_temporary/0 dir is being created with root permission even when running the job as non root

Re: Unable to run Spark application

2015-04-01 Thread Vijayasarathy Kannan
Managed to make "sbt assembly" work. I run into another issue now. When I do "./sbin/start-all.sh", the script fails saying JAVA_HOME is not set although I have explicitly set that variable to point to the correct Java location. Same happens with "./sbin/start-master.sh" script. Any idea what I mi

Re: Unable to run Spark application

2015-04-01 Thread Isaque Alves
You could try to run the 'make-distribution.sh' script with the proper options for your case. The script is in the extracted spark-1.2.1 folder.

Re: Spark Streaming 1.3 & Kafka Direct Streams

2015-04-01 Thread Tathagata Das
The challenge of opening up these internal classes to public (even with Developer API tag) is that it prevents us from making non-trivial changes without breaking API compatibility for all those who had subclassed. Its a tradeoff that is hard to optimize. That's why we favor exposing more optional

Spark on EC2

2015-04-01 Thread Vadim Bichutskiy
Hi all, I just tried launching a Spark cluster on EC2 as described in http://spark.apache.org/docs/1.3.0/ec2-scripts.html I got the following response: *"PendingVerificationYour account is currently being verified. Verification normally takes less than 2 hours. Until your account is verified,

Re: Size of arbitrary state managed via DStream updateStateByKey

2015-04-01 Thread Vinoth Chandar
Thanks for confirming! On Wed, Apr 1, 2015 at 12:33 PM, Tathagata Das wrote: > In the current state yes there will be performance issues. It can be done > much more efficiently and we are working on doing that. > > TD > > On Wed, Apr 1, 2015 at 7:49 AM, Vinoth Chandar wrote: > >> Hi all, >> >>

RE: Use with Data justifying Spark

2015-04-01 Thread Vila, Didier
Sonal, Thanks for the link ( I worked with them yet) Do you have any other example more “exotics” where I can play with specific data too and algo ? Cheers, Didier From: Sonal Goyal [mailto:sonalgoy...@gmail.com] Sent: Wednesday, April 01, 2015 7:09 PM To: Vila, Didier Cc: user@spark.apache.

Re: Size of arbitrary state managed via DStream updateStateByKey

2015-04-01 Thread Tathagata Das
In the current state yes there will be performance issues. It can be done much more efficiently and we are working on doing that. TD On Wed, Apr 1, 2015 at 7:49 AM, Vinoth Chandar wrote: > Hi all, > > As I understand from docs and talks, the streaming state is in memory as > RDD (optionally che

persist(MEMORY_ONLY) takes lot of time

2015-04-01 Thread SamyaMaiti
Hi Experts, I have a parquet dataset of 550 MB ( 9 Blocks) in HDFS. I want to run SQL queries repetitively. Few questions : 1. When I do the below (persist to memory after reading from disk), it takes lot of time to persist to memory, any suggestions of how to tune this? val inputP

Re: Spark Streaming 1.3 & Kafka Direct Streams

2015-04-01 Thread Neelesh
+1 for subclassing. its more flexible if we can subclass the implementation classes. On Apr 1, 2015 12:19 PM, "Cody Koeninger" wrote: > As I said in the original ticket, I think the implementation classes > should be exposed so that people can subclass and override compute() to > suit their nee

Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-01 Thread Todd Nist
I have a feeling I’m missing a Jar that provides the support or could this may be related to https://issues.apache.org/jira/browse/SPARK-5792. If it is a Jar where would I find that ? I would have thought in the $HIVE/lib folder, but not sure which jar contains it. Error: Create Metric Temporary

Re: Spark Streaming 1.3 & Kafka Direct Streams

2015-04-01 Thread Cody Koeninger
As I said in the original ticket, I think the implementation classes should be exposed so that people can subclass and override compute() to suit their needs. Just adding a function from Time => Set[TopicAndPartition] wouldn't be sufficient for some of my current production use cases. compute() i

Re: Spark Streaming and JMS

2015-04-01 Thread Tathagata Das
Its not a built in component of Spark. However there is a spark-package for Apache Camel receiver which can integrate with JMS. http://spark-packages.org/package/synsys/spark I have not tried it but do check it out. TD On Wed, Apr 1, 2015 at 4:38 AM, danila wrote: > Hi Tathagata > > do you kno

Re: Spark Streaming 1.3 & Kafka Direct Streams

2015-04-01 Thread Tathagata Das
We should be able to support that use case in the direct API. It may be as simple as allowing the users to pass on a function that returns the set of topic+partitions to read from. That is function (Time) => Set[TopicAndPartition] This gets called every batch interval before the offsets are decided

Re: Spark Streaming 1.3 & Kafka Direct Streams

2015-04-01 Thread Neelesh
Thanks Cody! On Wed, Apr 1, 2015 at 11:21 AM, Cody Koeninger wrote: > If you want to change topics from batch to batch, you can always just > create a KafkaRDD repeatedly. > > The streaming code as it stands assumes a consistent set of topics > though. The implementation is private so you cant

Spark-EC2 Security Group Error

2015-04-01 Thread Ganelin, Ilya
Hi all – I’m trying to bring up a spark ec2 cluster with the script below and see the following error. Can anyone please advise as to what’s going on? Is this indicative of me being unable to connect in the first place? The keys are known to work (they’re used elsewhere). ./spark-ec2 -k $AWS_KE

Re: Spark Streaming 1.3 & Kafka Direct Streams

2015-04-01 Thread Cody Koeninger
If you want to change topics from batch to batch, you can always just create a KafkaRDD repeatedly. The streaming code as it stands assumes a consistent set of topics though. The implementation is private so you cant subclass it without building your own spark. On Wed, Apr 1, 2015 at 1:09 PM, Nee

RE: --driver-memory parameter doesn't work for spark-submmit on yarn?

2015-04-01 Thread Shuai Zheng
Nice. But when my case shows that even I use Yarn-Client, I have same issue. I do verify it several times. And I am running 1.3.0 on EMR (use the version dispatch by installSpark script from AWS). I agree _JAVA_OPTIONS is not a right solution, but I will use it until 1.4.0 out :) Regards, S

Re: How to specify the port for AM Actor ...

2015-04-01 Thread Manoj Samel
Filed https://issues.apache.org/jira/browse/SPARK-6653 On Sun, Mar 29, 2015 at 8:18 PM, Shixiong Zhu wrote: > LGTM. Could you open a JIRA and send a PR? Thanks. > > Best Regards, > Shixiong Zhu > > 2015-03-28 7:14 GMT+08:00 Manoj Samel : > >> I looked @ the 1.3.0 code and figured where this can

Re: Spark Streaming 1.3 & Kafka Direct Streams

2015-04-01 Thread Neelesh
Thanks Cody, that was really helpful. I have a much better understanding now. One last question - Kafka topics are initialized once in the driver, is there an easy way of adding/removing topics on the fly? KafkaRDD#getPartitions() seems to be computed only once, and no way of refreshing them. T

Re: Error reading smallin in hive table with parquet format

2015-04-01 Thread Michael Armbrust
Can you try with Spark 1.3? Much of this code path has been rewritten / improved in this version. On Wed, Apr 1, 2015 at 7:53 AM, Masf wrote: > > Hi. > > In Spark SQL 1.2.0, with HiveContext, I'm executing the following > statement: > > CREATE TABLE testTable STORED AS PARQUET AS > SELECT > fi

Re: Use with Data justifying Spark

2015-04-01 Thread Sonal Goyal
Maybe check the examples? http://spark.apache.org/examples.html Best Regards, Sonal Founder, Nube Technologies On Wed, Apr 1, 2015 at 8:31 PM, Vila, Didier wrote: > Good Morning All, > > > > I would like to use Spark in a spec

Re: SparkSQL - Caching RDDs

2015-04-01 Thread Michael Armbrust
What do you mean by "permanently". If you start up the JDBC server and say CACHE TABLE it will stay cached as long as the server is running. CACHE TABLE is idempotent, so you could even just have that command in your BI tools setup queries. On Wed, Apr 1, 2015 at 11:02 AM, Venkat, Ankam wrote:

Re: Spark SQL saveAsParquet failed after a few waves

2015-04-01 Thread Michael Armbrust
> > When few waves (1 or 2) are used in a job, LoadApp could finish after a > few failures and retries. > But when more waves (3) are involved in a job, the job would terminate > abnormally. > Can you clarify what you mean by "waves"? Are you inserting from multiple programs concurrently?

Re: Broadcasting a parquet file using spark and python

2015-04-01 Thread Michael Armbrust
You will need to create a hive parquet table that points to the data and run "ANALYZE TABLE tableName noscan" so that we have statistics on the size. On Tue, Mar 31, 2015 at 9:36 PM, Jitesh chandra Mishra wrote: > Hi Michael, > > Thanks for your response. I am running 1.2.1. > > Is there any wor

SparkSQL - Caching RDDs

2015-04-01 Thread Venkat, Ankam
I am trying to integrate SparkSQL with a BI tool. My requirement is to query a Hive table very frequently from the BI tool. Is there a way to cache the Hive Table permanently in SparkSQL? I don't want to read the Hive table and cache it everytime the query is submitted from BI tool. Thanks! R

Re: HiveContext setConf seems not stable

2015-04-01 Thread Michael Armbrust
Can you open a JIRA please? On Wed, Apr 1, 2015 at 9:38 AM, Hao Ren wrote: > Hi, > > I find HiveContext.setConf does not work correctly. Here are some code > snippets showing the problem: > > snippet 1: > > -

Re: Spark 1.3.0 DataFrame and Postgres

2015-04-01 Thread Michael Armbrust
Can you open a JIRA for this please? On Wed, Apr 1, 2015 at 6:14 AM, Ted Yu wrote: > +1 on escaping column names. > > > > > On Apr 1, 2015, at 5:50 AM, fergjo00 wrote: > > > > Question: > > --- > > Is there a way to have JDBC DataFrames use quoted/escaped column names? > > Right now, it

Re: Latest enhancement in Low Level Receiver based Kafka Consumer

2015-04-01 Thread Dibyendu Bhattacharya
Thanks Neelesh. Even I have a plan to migrate the offset management to topic based (Kafka 0.8.2) rather zk based . That will make the consumer much faster . If you have plan to contribute your work for this consumer , that will be great also. Dibyendu On Wed, Apr 1, 2015 at 11:07 PM, Neelesh

Re: Spark Streaming S3 Performance Implications

2015-04-01 Thread Mike Trienis
Hey Chris, Apologies for the delayed reply. Your responses are always insightful and appreciated :-) However, I have a few more questions. "also, it looks like you're writing to S3 per RDD. you'll want to broaden that out to write DStream batches" I assume you mean "dstream.saveAsTextFiles(...

Re: Unable to run Spark application

2015-04-01 Thread Vijayasarathy Kannan
That is failing too, with "sbt.resolveexception: unresolved dependency:org.apache.spark#spark-network-common_2.10;1.2.1" On Wed, Apr 1, 2015 at 1:24 PM, Marcelo Vanzin wrote: > Try "sbt assembly" instead. > > On Wed, Apr 1, 2015 at 10:09 AM, Vijayasarathy Kannan > wrote: > > Why do I get > >

Re: Latest enhancement in Low Level Receiver based Kafka Consumer

2015-04-01 Thread Neelesh
Hi Dibyendu, Thanks for your work on this project. Spark 1.3 now has direct kafka streams, but still does not provide enough control over partitions and topics. For example, the streams are fairly statically configured - RDD.getPartitions() is computed only once, thus making it difficult to use

Re: Unable to run Spark application

2015-04-01 Thread Marcelo Vanzin
Try "sbt assembly" instead. On Wed, Apr 1, 2015 at 10:09 AM, Vijayasarathy Kannan wrote: > Why do I get > "Failed to find Spark assembly JAR. > You need to build Spark before running this program." ? > > I downloaded "spark-1.2.1.tgz" from the downloads page and extracted it. > When I do "sbt pac

Re: How to start master and workers on Windows

2015-04-01 Thread ARose
I'm in the same boat. What are the equivalent commands to stop the master and workers? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-start-master-and-workers-on-Windows-tp12669p22341.html Sent from the Apache Spark User List mailing list archive at

Unable to run Spark application

2015-04-01 Thread Vijayasarathy Kannan
Why do I get "Failed to find Spark assembly JAR. You need to build Spark before running this program." ? I downloaded "spark-1.2.1.tgz" from the downloads page and extracted it. When I do "sbt package" inside my application, it worked fine. But when I try to run my application, I get the above men

Re: Spark Streaming 1.3 & Kafka Direct Streams

2015-04-01 Thread Cody Koeninger
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md The kafka consumers run in the executors. On Wed, Apr 1, 2015 at 11:18 AM, Neelesh wrote: > With receivers, it was pretty obvious which code ran where - each receiver > occupied a core and ran on the workers. However, with

Re: Strategy regarding maximum number of executor's failure for log running jobs/ spark streaming jobs

2015-04-01 Thread Sandy Ryza
That's a good question, Twinkle. One solution could be to allow a maximum number of failures within any given time span. E.g. a max failures per hour property. -Sandy On Tue, Mar 31, 2015 at 11:52 PM, twinkle sachdeva < twinkle.sachd...@gmail.com> wrote: > Hi, > > In spark over YARN, there is

Latest enhancement in Low Level Receiver based Kafka Consumer

2015-04-01 Thread Dibyendu Bhattacharya
Hi, Just to let you know, I have made some enhancement in Low Level Reliable Receiver based Kafka Consumer ( http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) . Earlier version uses as many Receiver task for number of partitions of your kafka topic . Now you can configure desired

HiveContext setConf seems not stable

2015-04-01 Thread Hao Ren
Hi, I find HiveContext.setConf does not work correctly. Here are some code snippets showing the problem: snippet 1: import org.apache.spark.sql.hive.HiveContext import org.apache.spark

Issue on Spark SQL insert or create table with Spark running on AWS EMR -- s3n.S3NativeFileSystem: rename never finished

2015-04-01 Thread chutium
Hi, we always get issues on inserting or creating table with Amazon EMR Spark version, by inserting about 1GB resultset, the spark sql query will never be finished. by inserting small resultset (like 500MB), works fine. *spark.sql.shuffle.partitions* by default 200 or *set spark.sql.shuffle.par

Spark Streaming 1.3 & Kafka Direct Streams

2015-04-01 Thread Neelesh
With receivers, it was pretty obvious which code ran where - each receiver occupied a core and ran on the workers. However, with the new kafka direct input streams, its hard for me to understand where the code that's reading from kafka brokers runs. Does it run on the driver (I hope not), or does i

Re: Spark 1.3.0 missing dependency?

2015-04-01 Thread Sean Owen
No, it means you have a conflicting version of Xerces somewhere in your classpath. Maybe your app is bundling an old version? On Wed, Apr 1, 2015 at 4:46 PM, ARose wrote: > Upon executing these two lines of code: > > conf = new SparkConf().setAppName(appName).setMaster(master); >

Spark 1.3.0 missing dependency?

2015-04-01 Thread ARose
Upon executing these two lines of code: conf = new SparkConf().setAppName(appName).setMaster(master); sc = new JavaSparkContext(conf); I get the following error message: ERROR Configuration: Failed to set setXIncludeAware(true) for parser org.apache.xerces.jaxp.DocumentBuilderFac

Re: deployment of spark on mesos and data locality in tachyon/hdfs

2015-04-01 Thread Haoyuan Li
Response inline. On Tue, Mar 31, 2015 at 10:41 PM, Sean Bigdatafun wrote: > (resending...) > > I was thinking the same setup… But the more I think of this problem, and > the more interesting this could be. > > If we allocate 50% total memory to Tachyon statically, then the Mesos > benefits of dy

[SparkSQL] Zip DataFrame with a RDD

2015-04-01 Thread Jaonary Rabarisoa
Hi all, Is it posible to zip an existing DataFrame with a RDD[T] such that the result is a new DataFrame with one more column that the first one and the additionnal column corresponds to the RDD[T] ? In other words, is it possible to zip 2 DataFrame ? Cheers, Jaonary

Re: Anatomy of RDD : Deep dive into RDD data structure

2015-04-01 Thread David Holiday
w00t - t/y for this! I'm currently doing a deep dive into the RDD memory footprint under various conditions so this is timely and helpful. DAVID HOLIDAY Software Engineer 760 607 3300 | Office 312 758 8385 | Mobile dav...@annaisystems.com [cid:AE39C43E-3FF7-4C90

Use with Data justifying Spark

2015-04-01 Thread Vila, Didier
Good Morning All, I would like to use Spark in a special "synthetics" case that justifies the use of spark. So , I am looking for a case based on data ( can be big) and eventually the associated java and/or python code. It will be fantastic if you can refer me a link where I can load this ca

Error reading smallin in hive table with parquet format

2015-04-01 Thread Masf
Hi. In Spark SQL 1.2.0, with HiveContext, I'm executing the following statement: CREATE TABLE testTable STORED AS PARQUET AS SELECT field1 FROM table1 *field1 is SMALLINT. If table1 is in text format all it's ok, but if table1 is in parquet format, spark returns the following error*: 15/04/

Re: --driver-memory parameter doesn't work for spark-submmit on yarn?

2015-04-01 Thread Sean Owen
I feel like I recognize that problem, and it's almost the inverse of https://issues.apache.org/jira/browse/SPARK-3884 which I was looking at today. The spark-class script didn't seem to handle all the ways that driver memory can be set. I think this is also something fixed by the new launcher libr

Size of arbitrary state managed via DStream updateStateByKey

2015-04-01 Thread Vinoth Chandar
Hi all, As I understand from docs and talks, the streaming state is in memory as RDD (optionally checkpointable to disk). SPARK-2629 hints that this in memory structure is not indexed efficiently? I am wondering how my performance would be if the streaming state does not fit in memory (say 100GB

RE: --driver-memory parameter doesn't work for spark-submmit on yarn?

2015-04-01 Thread Shuai Zheng
Hi Akhil, Thanks a lot! After set export _JAVA_OPTIONS="-Xmx5g", the OutOfMemory exception disappeared. But this make me confused, so the driver-memory options doesn’t work for spark-submit to YARN (I haven’t check other clusters), is it a bug? Regards, Shuai From: Akhil Da

Re: Spark 1.3 build with hive support fails on JLine

2015-04-01 Thread Ted Yu
Please invoke dev/change-version-to-2.11.sh before running mvn. Cheers On Mon, Mar 30, 2015 at 1:02 AM, Night Wolf wrote: > Hey, > > Trying to build Spark 1.3 with Scala 2.11 supporting yarn & hive (with > thrift server). > > Running; > > *mvn -e -DskipTests -Pscala-2.11 -Dscala-2.11 -Pyarn -Pm

Spark throws rsync: change_dir errors on startup

2015-04-01 Thread Horsmann, Tobias
Hi, I try to set up a minimal 2-node spark cluster for testing purposes. When I start the cluster with start-all.sh I get a rsync error message: rsync: change_dir "/usr/local/spark130/sbin//right" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see prev

Re: Disable stage logging to stdout

2015-04-01 Thread Theodore Vasiloudis
Thank you Sean that does the trick. On Wed, Apr 1, 2015 at 12:05 PM, Sean Owen wrote: > You can disable with spark.ui.showConsoleProgress=false but I also > wasn't sure why this writes straight to System.err, at first. I assume > it's because it's writing carriage returns to achieve the animati

Re: Spark 1.3.0 DataFrame and Postgres

2015-04-01 Thread Ted Yu
+1 on escaping column names. > On Apr 1, 2015, at 5:50 AM, fergjo00 wrote: > > Question: > --- > Is there a way to have JDBC DataFrames use quoted/escaped column names? > Right now, it looks like it "sees" the names correctly in the schema created > but does not escape them in the

Spark 1.3.0 DataFrame and Postgres

2015-04-01 Thread fergjo00
Question: --- Is there a way to have JDBC DataFrames use quoted/escaped column names? Right now, it looks like it "sees" the names correctly in the schema created but does not escape them in the SQL it creates when they are not compliant: org.apache.spark.sql.jdbc.JDBCRDD private v

Re: Creating Partitioned Parquet Tables via SparkSQL

2015-04-01 Thread Denny Lee
Thanks Felix :) On Wed, Apr 1, 2015 at 00:08 Felix Cheung wrote: > This is tracked by these JIRAs.. > > https://issues.apache.org/jira/browse/SPARK-5947 > https://issues.apache.org/jira/browse/SPARK-5948 > > -- > From: denny.g@gmail.com > Date: Wed, 1 Apr 2015 04:

RE: Spark + Kafka

2015-04-01 Thread Shao, Saisai
OK, seems there’s nothing strange from your code. So maybe we need to narrow down the cause, would you please run KafkaWordCount example in Spark to see if it is OK, if this is OK, then we should focus on your implementation, otherwise Kafka potentially has some problems. Thanks Jerry From: Ja

Re: Spark Streaming and JMS

2015-04-01 Thread danila
Hi Tathagata do you know if JMS Reciever was introduced during last year as standard Spark component or somebody is developing it? Regards Danila -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-and-JMS-tp5371p22337.html Sent from the Apach

Re: Spark + Kafka

2015-04-01 Thread James King
This is the code. And I couldn't find anything like the log you suggested. public KafkaLogConsumer(int duration, String master) { JavaStreamingContext spark = createSparkContext(duration, master); Map topics = new HashMap(); topics.put("test", 1); JavaPairDStream input = KafkaUtils.createStream

Re: Disable stage logging to stdout

2015-04-01 Thread Sean Owen
You can disable with spark.ui.showConsoleProgress=false but I also wasn't sure why this writes straight to System.err, at first. I assume it's because it's writing carriage returns to achieve the animation and this won't work via a logging framework. stderr is where log-like output goes, because st

Disable stage logging to stdout

2015-04-01 Thread Theodore Vasiloudis
Since switching to Spark 1.2.1 I'm seeing logging for the stage progress (ex.): [error] [Stage 2154:> (14 + 8) / 48][Stage 2210:> (0 + 0) / 48] Any reason why these are error level logs? Shouldn't they be info level? In any case is there a way to disable them other than

Re: Spark + Kafka

2015-04-01 Thread James King
Thanks Saisai, Sure will do. But just a quick note that when i set master as "local[*]" Spark started showing Kafka messages as expected, so the problem in my view was to do with not enough threads to process the incoming data. Thanks. On Wed, Apr 1, 2015 at 10:53 AM, Saisai Shao wrote: > Wo

Re: Spark + Kafka

2015-04-01 Thread Saisai Shao
Would you please share your code snippet please, so we can identify is there anything wrong in your code. Beside would you please grep your driver's debug log to see if there's any debug log about "Stream xxx received block xxx", this means that Spark Streaming is keeping receiving data from sourc

Re: Using 'fair' scheduler mode

2015-04-01 Thread Mark Hamstra
> > I am using the Spark ‘fair’ scheduler mode. What do you mean by this? Fair scheduling mode is not one thing in Spark, but allows for multiple configurations and usages. Presumably, at a minimum you are using SparkConf to set spark.scheduling.mode to "FAIR", but then how are you setting up s

Re: Spark + Kafka

2015-04-01 Thread James King
Thank you bit1129, >From looking at the web UI i can see 2 cores Also looking at http://spark.apache.org/docs/1.2.1/configuration.html But can't see obvious configuration for number of receivers can you help please. On Wed, Apr 1, 2015 at 9:39 AM, bit1...@163.com wrote: > Please make sure th

Re: Re: rdd.cache() not working ?

2015-04-01 Thread fightf...@163.com
Hi all Thanks a lot for caspuring this. We are now using 1.3.0 release. We tested with both prebuilt version spark and source code compiling version targeting our CDH component, and the cache result did not show as expected. However, if we create dataframe with the person rdd and using sqlCo

  1   2   >