Re: stopping spark stream app

2015-08-09 Thread Shushant Arora
Hi How to ensure in spark streaming 1.3 with kafka that when an application is killed , last running batch is fully processed and offsets are written to checkpointing dir. On Fri, Aug 7, 2015 at 8:56 AM, Shushant Arora shushantaror...@gmail.com wrote: Hi I am using spark stream 1.3 and using

deleting application files in standalone cluster

2015-08-09 Thread Lior Chaga
Hi, Using spark 1.4.0 in standalone mode, with following configuration: SPARK_WORKER_OPTS=-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.appDataTtl=86400 cleanup interval is set to default. Application files are not deleted. Using JavaSparkContext, and when the application ends it

ERROR ReceiverTracker: Deregistered receiver for stream 0: Stopped by driver

2015-08-09 Thread Sadaf
Hi When i tried to stop spark streaming using ssc.stop(false,true) It gives the following error. ERROR ReceiverTracker: Deregistered receiver for stream 0: Stopped by driver 15/08/07 13:41:11 WARN ReceiverSupervisorImpl: Stopped executor without error 15/08/07 13:41:20 WARN WriteAheadLogManager :

Re: Spark-submit not finding main class and the error reflects different path to jar file than specified

2015-08-09 Thread Akhil Das
Are you setting SPARK_PREPEND_CLASSES? try to disable it. Here your uber jar which does not have the SparkConf is put in the first place of the class-path which is messing it up. Thanks Best Regards On Thu, Aug 6, 2015 at 5:48 PM, Stephen Boesch java...@gmail.com wrote: Given the following

Re: Multiple Thrift servers on one Spark cluster

2015-08-09 Thread Akhil Das
Did you try this way? export HIVE_SERVER2_THRIFT_PORT=6066 ./sbin/start-thriftserver.sh --master master-uri export HIVE_SERVER2_THRIFT_PORT=6067 ./sbin/start-thriftserver.sh --master master-uri You just have to change HIVE_SERVER2_THRIFT_PORT to instantiate multiple servers i think. Thanks

Re: SparkException: Yarn application has already ended

2015-08-09 Thread Akhil Das
Just make sure your hadoop instances are functioning properly, (check for ResourceManager, NodeManager). How are you submitting the job? If that is getting submitted then you can look further in the yarn logs to see whats really going on. Thanks Best Regards On Thu, Aug 6, 2015 at 6:59 PM, Clint

Re: Temp file missing when training logistic regression

2015-08-09 Thread Akhil Das
Which version of spark are you using? Looks like you are hitting the file handles. In that case you might want to increase the ulimit. You can actually validate this by looking in the worker logs (which would probably say Too many open files exception). Thanks Best Regards On Thu, Aug 6, 2015 at

Re: Spark Job Failed (Executor Lost then FS closed)

2015-08-09 Thread Akhil Das
Can you look more in the worker logs and see whats going on? It looks like a memory issue (kind of GC overhead etc., You need to look in the worker logs) Thanks Best Regards On Fri, Aug 7, 2015 at 3:21 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Re attaching the images. On Thu, Aug 6, 2015

Merge metadata error when appending to parquet table

2015-08-09 Thread Krzysztof Zarzycki
Hi there, I have a problem with a spark streaming job running on Spark 1.4.1, that appends to parquet table. My job receives json strings and creates JsonRdd out of it. The jsons might come in different shape as most of the fields are optional. But they never have conflicting schemas. Next, for

Starting a service with Spark Executors

2015-08-09 Thread Daniel Haviv
Hi, I'd like to start a service with each Spark Executor upon initalization and have the disributed code reference that service locally. What I'm trying to do is to invoke torch7 computations without reloading the model for each row by starting Lua http handler that will recieve http requests for

Re: Spark-submit fails when jar is in HDFS

2015-08-09 Thread Akhil Das
Did you try this way? /usr/local/spark/bin/spark-submit --master mesos://mesos.master:5050 --conf spark.mesos.executor.docker.image=docker.repo/spark:latest --class org.apache.spark.examples.SparkPi *--jars hdfs://hdfs1/tmp/spark-* *examples-1.4.1-hadoop2.6.0-**cdh5.4.4.jar* 100 Thanks Best

Re: Out of memory with twitter spark streaming

2015-08-09 Thread Akhil Das
I'm not sure what you are upto, but if you can explain what you are trying to achieve then may be we can restructure your code. On a quick glance i could see : tweetsRDD*.collect()*.map(tweet= DBQuery.saveTweets(tweet)) Which will bring the whole data into your driver machine and it would

Re: Accessing S3 files with s3n://

2015-08-09 Thread Akhil Das
Depends on which operation you are doing, If you are doing a .count() on a parquet, it might not download the entire file i think, but if you do a .count() on a normal text file it might pull the entire file. Thanks Best Regards On Sat, Aug 8, 2015 at 3:12 AM, Akshat Aranya aara...@gmail.com

Re: using Spark or pig group by efficient in my use case?

2015-08-09 Thread Akhil Das
Why not give it a shot? Spark always outruns old mapreduce jobs. Thanks Best Regards On Sat, Aug 8, 2015 at 8:25 AM, linlma lin...@gmail.com wrote: I have a tens of million records, which is customer ID and city ID pair. There are tens of millions of unique customer ID, and only a few

Re: How to run start-thrift-server in debug mode?

2015-08-09 Thread Akhil Das
It seems, it is not able to pick up the debug parameters. You can actually set export _JAVA_OPTIONS=-agentlib:jdwp=transport=dt_socket,address=8000,server=y,suspend=y and then submit the job to enable debugging. Thanks Best Regards On Fri, Aug 7, 2015 at 10:20 PM, Benjamin Ross

Questions about SparkSQL join on not equality conditions

2015-08-09 Thread gen tang
Hi, I might have a stupid question about sparksql's implementation of join on not equality conditions, for instance condition1 or condition2. In fact, Hive doesn't support such join, as it is very difficult to express such conditions as a map/reduce job. However, sparksql supports such

Re: How to create DataFrame from a binary file?

2015-08-09 Thread bo yang
You can create your own data schema (StructType in spark), and use following method to create data frame with your own data schema: sqlContext.createDataFrame(yourRDD, structType); I wrote a post on how to do it. You can also get the sample code there: Light-Weight Self-Service Data Query

Re: Spark-submit fails when jar is in HDFS

2015-08-09 Thread Dean Wampler
Also, Spark on Mesos supports cluster mode: http://spark.apache.org/docs/latest/running-on-mesos.html#cluster-mode Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler

stream application map transformation constructor called

2015-08-09 Thread Shushant Arora
In stream application how many times the map transformation object being created? Say I have directKafkaStream.repartition(numPartitions).mapPartitions (new FlatMapFunction_derivedclass(configs)); class FlatMapFunction_derivedclass{ FlatMapFunction_derivedclass(Config config){ } @Override

Re: intellij14 compiling spark-1.3.1 got error: assertion failed: com.google.protobuf.InvalidProtocalBufferException

2015-08-09 Thread Ted Yu
Can you check if there is protobuf version other than 2.5.0 on the classpath ? Please show the complete stack trace. Cheers On Sun, Aug 9, 2015 at 9:41 AM, longda...@163.com longda...@163.com wrote: hi all, i compile spark-1.3.1 on linux use intellij14 and got error assertion failed:

Re: intellij14 compiling spark-1.3.1 got error: assertion failed: com.google.protobuf.InvalidProtocalBufferException

2015-08-09 Thread longda...@163.com
the stack trace is below Error:scalac: while compiling: /home/xiaoju/data/spark-1.3.1/core/src/main/scala/org/apache/spark/SparkContext.scala during phase: typer library version: version 2.10.4 compiler version: version 2.10.4 reconstructed args: -nobootcp

Re: Accessing S3 files with s3n://

2015-08-09 Thread Jerry Lam
Hi Akshat, Is there a particular reason you don't use s3a? From my experience,s3a performs much better than the rest. I believe the inefficiency is from the implementation of the s3 interface. Best Regards, Jerry Sent from my iPhone On 9 Aug, 2015, at 5:48 am, Akhil Das

Re: Merge metadata error when appending to parquet table

2015-08-09 Thread Krzysztof Zarzycki
Besides finding to this problem, I think I can workaround at least the WARNING message by overwriting parquet variable: parquet.enable.summary-metadata That according to this PARQUET-107 https://issues.apache.org/jira/browse/PARQUET-107 ticket can be used to disable writing summary file which is

intellij14 compiling spark-1.3.1 got error: assertion failed: com.google.protobuf.InvalidProtocalBufferException

2015-08-09 Thread longda...@163.com
hi all, i compile spark-1.3.1 on linux use intellij14 and got error assertion failed: com.google.protobuf.InvalidProtocalBufferException, how could i solve the problem? -- View this message in context:

Re: How to create DataFrame from a binary file?

2015-08-09 Thread Umesh Kacha
Hi Bo I know how to create a DataFrame my question is how to create a DataFrame for binary files and in your blog it is raw text json files please read my question properly thanks. On Sun, Aug 9, 2015 at 11:21 PM, bo yang bobyan...@gmail.com wrote: You can create your own data schema

Re:Re: intellij14 compiling spark-1.3.1 got error: assertion failed: com.google.protobuf.InvalidProtocalBufferException

2015-08-09 Thread 龙淡
thank you for reply, i use sbt to complie spark, but there are both protobuf 2.4.1 and 2.5.0 in maven repository , and protobuf 2.5.0 in .ivy repository, the stack trace is below Error:scalac: while compiling:

Re: Spark-submit fails when jar is in HDFS

2015-08-09 Thread Alan Braithwaite
Did you try this way? /usr/local/spark/bin/spark-submit --master mesos://mesos.master:5050 --conf spark.mesos.executor.docker.image=docker.repo/spark:latest --class org.apache.spark.examples.SparkPi --jars hdfs://hdfs1/tmp/spark-examples-1.4.1-hadoop2.6.0-cdh5.4.4.jar 100 I did, and got

Re: How to create DataFrame from a binary file?

2015-08-09 Thread bo yang
Well, my post uses raw text json file to show how to create data frame with a custom data schema. The key idea is to show the flexibility to deal with any format of data by using your own schema. Sorry if I did not make you fully understand. Anyway, let us know once you figure out your problem.

Re: Merge metadata error when appending to parquet table

2015-08-09 Thread Cheng Lian
The conflicting metadata values warning is a known issue https://issues.apache.org/jira/browse/PARQUET-194 The option parquet.enable.summary-metadata is a Hadoop option rather than a Spark option, so you need to either add it to your Hadoop configuration file(s) or add it via

Re: Spark job workflow engine recommendations

2015-08-09 Thread Lars Albertsson
I used to maintain Luigi at Spotify, and got some insight in workflow manager characteristics and production behaviour in the process. I am evaluating options for my current employer, and the short list is basically: Luigi, Azkaban, Pinball, Airflow, and rolling our own. The latter is not

multiple dependency jars using pyspark

2015-08-09 Thread Jonathan Haddad
I'm trying to write a simple job for Pyspark 1.4 migrating data from MySQL to Cassandra. I can work with either the MySQL JDBC jar or the cassandra jar separately without issue, but when I try to reference both of them it throws an exception: Py4JJavaError: An error occurred while calling

Error when running pyspark/shell.py to set up iPython notebook

2015-08-09 Thread YaoPau
I'm trying to set up iPython notebook on an edge node with port forwarding so I can run pyspark off my laptop's browser. I've mostly been following the Cloudera guide here: http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/ I got this working on one cluster

mllib kmeans produce 1 large and many extremely small clusters

2015-08-09 Thread farhan
I tried running mllib k-means with 20newsgroups data set from sklearn. On a 5000 document data set I get one cluster with most of the documents and other clusters just have handful of documents. #code newsgroups_train = fetch_20newsgroups(subset='train',random_state=1,remove=('headers',

Re: Starting a service with Spark Executors

2015-08-09 Thread Koert Kuipers
starting is easy, just use a lazy val. stopping is harder. i do not think executors have a cleanup hook currently... On Sun, Aug 9, 2015 at 5:29 AM, Daniel Haviv daniel.ha...@veracity-group.com wrote: Hi, I'd like to start a service with each Spark Executor upon initalization and have the

Re: Accessing S3 files with s3n://

2015-08-09 Thread bo yang
Hi Akshat, I find some open source library which implements S3 InputFormat for Hadoop. Then I use Spark newAPIHadoopRDD to load data via that S3 InputFormat. The open source library is https://github.com/ATLANTBH/emr-s3-io. It is a little old. I look inside it and make some changes. Then it