Re: Removing empty partitions before we write to HDFS

2015-08-06 Thread Patanachai Tangchaisin
Currently, I use rdd.isEmpty() Thanks, Patanachai On 08/06/2015 12:02 PM, gpatcham wrote: Is there a way to filter out empty partitions before I write to HDFS other than using reparition and colasce ? -- View this message in context:

Re: log4j.xml bundled in jar vs log4.properties in spark/conf

2015-08-06 Thread mlemay
I'm having the same problem here. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/log4j-xml-bundled-in-jar-vs-log4-properties-in-spark-conf-tp23923p24158.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: shutdown local hivecontext?

2015-08-06 Thread Cesar Flores
Well. I managed to solve that issue after running my tests on a linux system instead of windows (which I was originally using). However, now I have an error when I try to reset the hive context using hc.reset(). It tries to create a file inside directory /user/my_user_name instead of the usual

Spark Job Failed (Executor Lost then FS closed)

2015-08-06 Thread ๏̯͡๏
Code: import java.text.SimpleDateFormat import java.util.Calendar import java.sql.Date import org.apache.spark.storage.StorageLevel def extract(array: Array[String], index: Integer) = { if (index array.length) { array(index).replaceAll(\, ) } else { } } case class GuidSess(

Re: shutdown local hivecontext?

2015-08-06 Thread Cesar Flores
Well, I try this approach, and still have issues. Apparently TestHive can not delete the hive metastore directory. The complete error that I have is: 15/08/06 15:01:29 ERROR Driver: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.

Re: Removing empty partitions before we write to HDFS

2015-08-06 Thread Richard Marscher
Not that I'm aware of. We ran into the similar issue where we didn't want to keep accumulating all these empty part files in storage on S3 or HDFS. There didn't seem to be any performance free way to do it with an RDD, so we just run a non-spark post-batch operation to delete empty files from the

How can I know currently supported functions in Spark SQL

2015-08-06 Thread Netwaver
Hi All, I am using Spark 1.4.1, and I want to know how can I find the complete function list supported in Spark SQL, currently I only know 'sum','count','min','max'. Thanks a lot.

how to stop twitter-spark streaming

2015-08-06 Thread Sadaf
Hi All, i am working with spark streaming and twitter's user api. i used this code to stop streaming ssc.addStreamingListener(new StreamingListener{ var count=1 override def onBatchCompleted(batchCompleted: StreamingListenerBatchCompleted) { count += 1

Re: Is there any way to support multiple users executing SQL on thrift server?

2015-08-06 Thread Ted Yu
What is the JIRA number if a JIRA has been logged for this ? Thanks On Jan 20, 2015, at 11:30 AM, Cheng Lian lian.cs@gmail.com wrote: Hey Yi, I'm quite unfamiliar with Hadoop/HDFS auth mechanisms for now, but would like to investigate this issue later. Would you please open an

Re: spark job not accepting resources from worker

2015-08-06 Thread Kushal Chokhani
Any inputs? In case of following message, is there a way to check which resources is not sufficient through some logs? [Timer-0] WARN org.apache.spark.scheduler.TaskSchedulerImpl - Initial job has not accepted any resources; check your cluster UI to ensure that workers are

Re:Re: Real-time data visualization with Zeppelin

2015-08-06 Thread jun
Hi andy, Is there any method to convert ipython notebook file(.ipynb) to spark notebook file(.snb) or vice versa? BR Jun At 2015-07-13 02:45:57, andy petrella andy.petre...@gmail.com wrote: Heya, You might be looking for something like this I guess:

Re: spark job not accepting resources from worker

2015-08-06 Thread Kushal Chokhani
Figured out the root cause. Master was randomly assigning port to worker for communication. Because of the firewall on master, worker couldn't send out messages to master (maybe like resource details). Weird worker didn't even bother to throw any error also. On 8/6/2015 3:24 PM, Kushal

Re: Re: Real-time data visualization with Zeppelin

2015-08-06 Thread andy petrella
Yep, most of the things will work just by renaming it :-D You can even use nbconvert afterwards On Thu, Aug 6, 2015 at 12:09 PM jun kit...@126.com wrote: Hi andy, Is there any method to convert ipython notebook file(.ipynb) to spark notebook file(.snb) or vice versa? BR Jun At

Re: How to read gzip data in Spark - Simple question

2015-08-06 Thread ๏̯͡๏
I got it running by myself On Wed, Aug 5, 2015 at 10:27 PM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: Have you tried reading the spark documentation? http://spark.apache.org/docs/latest/programming-guide.html Thank you, Ilya Ganelin -Original Message- *From: *ÐΞ€ρ@Ҝ

Re: Unable to persist RDD to HDFS

2015-08-06 Thread Philip Weaver
This isn't really a Spark question. You're trying to parse a string to an integer, but it contains an invalid character. The exception message explains this. On Wed, Aug 5, 2015 at 11:34 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Code: import java.text.SimpleDateFormat import

Re: Re: How can I know currently supported functions in Spark SQL

2015-08-06 Thread Pedro Rodriguez
Worth noting that Spark 1.5 is extending that list of Spark SQL functions quite a bit. Not sure where in the docs they would be yet, but the JIRA is here: https://issues.apache.org/jira/browse/SPARK-8159 On Thu, Aug 6, 2015 at 7:27 PM, Netwaver wanglong_...@163.com wrote: Thanks for your kindly

Re:Re: How can I know currently supported functions in Spark SQL

2015-08-06 Thread Netwaver
Thanks for your kindly help At 2015-08-06 19:28:10, Todd Nist tsind...@gmail.com wrote: They are covered here in the docs: http://spark.apache.org/docs/1.4.1/api/scala/index.html#org.apache.spark.sql.functions$ On Thu, Aug 6, 2015 at 5:52 AM, Netwaver wanglong_...@163.com wrote: Hi

Spark-submit fails when jar is in HDFS

2015-08-06 Thread abraithwaite
Hi All, We're trying to run spark with mesos and docker in client mode (since mesos doesn't support cluster mode) and load the application Jar from HDFS. The following is the command we're running: We're getting the following warning before an exception from that command: Before I debug

Re: Spark MLib v/s SparkR

2015-08-06 Thread praveen S
I am starting off with classification models, Logistic,RandomForest. Basically wanted to learn Machine learning. Since I have a java background I started off with MLib, but later heard R works as well ( with scaling issues - only). So, with SparkR was wondering the scaling issue would be resolved

Out of memory with twitter spark streaming

2015-08-06 Thread Pankaj Narang
Hi I am running one application using activator where I am retrieving tweets and storing them to mysql database using below code. I get OOM error after 5-6 hour with xmx1048M. If I increase the memory the OOM get delayed only. Can anybody give me clue. Here is the code var tweetStream =

stopping spark stream app

2015-08-06 Thread Shushant Arora
Hi I am using spark stream 1.3 and using custom checkpoint to save kafka offsets. 1.Is doing Runtime.getRuntime().addShutdownHook(new Thread() { @Override public void run() { jssc.stop(true, true); System.out.println(Inside Add Shutdown Hook); } }); to handle stop is safe ? 2.And I

spark job not accepting resources from worker

2015-08-06 Thread Kushal Chokhani
Hi I have a spark/cassandra setup where I am using a spark cassandra java connector to query on a table. So far, I have 1 spark master node (2 cores) and 1 worker node (4 cores). Both of them have following spark-env.sh under conf/: |#!/usr/bin/env bash export SPARK_LOCAL_IP=127.0.0.1

Re: Multiple UpdateStateByKey Functions in the same job?

2015-08-06 Thread Akhil Das
I think you can. Give it a try and see. Thanks Best Regards On Tue, Aug 4, 2015 at 7:02 AM, swetha swethakasire...@gmail.com wrote: Hi, Can I use multiple UpdateStateByKey Functions in the Streaming job? Suppose I need to maintain the state of the user session in the form of a Json and

Re: Extremely poor predictive performance with RF in mllib

2015-08-06 Thread Yanbo Liang
I can reproduce this issue, so looks like a bug of Random Forest, I will try to find some clue. 2015-08-05 1:34 GMT+08:00 Patrick Lam pkph...@gmail.com: Yes, I rechecked and the label is correct. As you can see in the code posted, I used the exact same features for all the classifiers so

Unable to persist RDD to HDFS

2015-08-06 Thread ๏̯͡๏
Code: import java.text.SimpleDateFormat import java.util.Calendar import java.sql.Date import org.apache.spark.storage.StorageLevel def formatStringAsDate(dateStr: String) = new java.sql.Date(new SimpleDateFormat(-MM-dd).parse(dateStr).getTime())

Re: NoSuchMethodError : org.apache.spark.streaming.scheduler.StreamingListenerBus.start()V

2015-08-06 Thread Akhil Das
For some reason you are having two different versions of spark jars in your classpath. Thanks Best Regards On Tue, Aug 4, 2015 at 12:37 PM, Deepesh Maheshwari deepesh.maheshwar...@gmail.com wrote: Hi, I am trying to read data from kafka and process it using spark. i have attached my source

Talk on Deep dive in Spark Dataframe API

2015-08-06 Thread madhu phatak
Hi, Recently I gave a talk on a deep dive into data frame api and sql catalyst . Video of the same is available on Youtube with slides and code. Please have a look if you are interested. *http://blog.madhukaraphatak.com/anatomy-of-spark-dataframe-api/

Re: Twitter Connector-Spark Streaming

2015-08-06 Thread Akhil Das
Hi Sadaf Which version of spark are you using? And whats in the spark-env.sh file? I think you are using both SPARK_CLASSPATH (which is deprecated) and spark.executor.extraClasspath (may be set in spark-defaults.sh file). Thanks Best Regards On Wed, Aug 5, 2015 at 6:22 PM, Sadaf Khan

Re: Memory allocation error with Spark 1.5

2015-08-06 Thread Alexis Seigneurin
Works like a charm. Thanks Reynold for the quick and efficient response! Alexis 2015-08-05 19:19 GMT+02:00 Reynold Xin r...@databricks.com: In Spark 1.5, we have a new way to manage memory (part of Project Tungsten). The default unit of memory allocation is 64MB, which is way too high when

Re: Is it worth storing in ORC for one time read. And can be replace hive with HBase

2015-08-06 Thread Jörn Franke
Yes you should use orc it is much faster and more compact. Additionally you can apply compression (snappy) to increase performance. Your data processing pipeline seems to be not.very optimized. You should use the newest hive version enabling storage indexes and bloom filters on appropriate

Spark-submit not finding main class and the error reflects different path to jar file than specified

2015-08-06 Thread Stephen Boesch
Given the following command line to spark-submit: bin/spark-submit --verbose --master local[2]--class org.yardstick.spark.SparkCoreRDDBenchmark /shared/ysgood/target/yardstick-spark-uber-0.0.1.jar Here is the output: NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes

Re: Starting Spark SQL thrift server from within a streaming app

2015-08-06 Thread Todd Nist
Well the creation of a thrift server would be to allow external access to the data from JDBC / ODBC type connections. The sparkstreaming-sql leverages a standard spark sql context and then provides a means of converting an incoming dstream into a row, look at the MessageToRow trait in KafkaSource

Is it worth storing in ORC for one time read. And can be replace hive with HBase

2015-08-06 Thread venkatesh b
Hi, here I got two things to know. FIRST: In our project we use hive. We daily get new data. We need to process this new data only once. And send this processed data to RDBMS. Here in processing we majorly use many complex queries with joins with where condition and grouping functions. There are

Re: Is it worth storing in ORC for one time read. And can be replace hive with HBase

2015-08-06 Thread Jörn Franke
Additionally it is of key importance to use the right data types for the columns. Use int for ids, int or decimal or float or double etc for numeric values etc. - A bad data model using varchars and string where not appropriate is a significant bottle neck. Furthermore include partition columns

Re: How can I know currently supported functions in Spark SQL

2015-08-06 Thread Todd Nist
They are covered here in the docs: http://spark.apache.org/docs/1.4.1/api/scala/index.html#org.apache.spark.sql.functions$ On Thu, Aug 6, 2015 at 5:52 AM, Netwaver wanglong_...@163.com wrote: Hi All, I am using Spark 1.4.1, and I want to know how can I find the complete function

Re: Starting Spark SQL thrift server from within a streaming app

2015-08-06 Thread Daniel Haviv
Thank you Todd, How is the sparkstreaming-sql project different from starting a thrift server on a streaming app ? Thanks again. Daniel On Thu, Aug 6, 2015 at 1:53 AM, Todd Nist tsind...@gmail.com wrote: Hi Danniel, It is possible to create an instance of the SparkSQL Thrift server,

Re: How can I know currently supported functions in Spark SQL

2015-08-06 Thread Ted Yu
Have you looked at this? http://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.functions$ On Aug 6, 2015, at 2:52 AM, Netwaver wanglong_...@163.com wrote: Hi All, I am using Spark 1.4.1, and I want to know how can I find the complete function list

Re: Combining Spark Files with saveAsTextFile

2015-08-06 Thread MEETHU MATHEW
Hi,Try using coalesce(1) before calling saveAsTextFile() Thanks Regards, Meethu M On Wednesday, 5 August 2015 7:53 AM, Brandon White bwwintheho...@gmail.com wrote: What is the best way to make saveAsTextFile save as only a single file?

Re: Spark Kinesis Checkpointing/Processing Delay

2015-08-06 Thread Patanachai Tangchaisin
Hi, I actually run into the same problem although our endpoint is not ElasticSearch. When the spark job is dead, we lose some data because Kinesis checkpoint is already beyond the last point that spark is processed. Currently, our workaround is to use spark's checkpoint mechanism with write

Re: SparkR -Graphx

2015-08-06 Thread Shivaram Venkataraman
+Xiangrui I am not sure exposing the entire GraphX API would make sense as it contains a lot of low level functions. However we could expose some high level functions like PageRank etc. Xiangrui, who has been working on similar techniques to expose MLLib functions like GLM might have more to add.

Re: Specifying the role when launching an AWS spark cluster using spark_ec2

2015-08-06 Thread Steve Loughran
There's no support for IAM roles in the s3n:// client code in Apache Hadoop ( HADOOP-9384 ); Amazon's modified EMR distro may have it.. The s3a filesystem adds it, —this is ready for production use in Hadoop 2.7.1+ (implicitly HDP 2.3; CDH 5.4 has cherrypicked the relevant patches.) I don't

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-06 Thread Philip Weaver
With DEBUG, the log output was over 10MB, so I opted for just INFO output. The (sanitized) log is attached. The driver is essentially this code: info(A) val t = System.currentTimeMillis val df = sqlContext.read.parquet(dir).select(...).cache val elapsed =

Spark Kinesis Checkpointing/Processing Delay

2015-08-06 Thread phibit
Hi! I'm using Spark + Kinesis ASL to process and persist stream data to ElasticSearch. For the most part it works nicely. There is a subtle issue I'm running into about how failures are handled. For example's sake, let's say I am processing a Kinesis stream that produces 400 records per second,

SparkR -Graphx

2015-08-06 Thread smagadi
Wanted to use the GRaphX from SparkR , is there a way to do it ?.I think as of now it is not possible.I was thinking if one can write a wrapper in R that can call Scala Graphx libraries . Any thought on this please. -- View this message in context:

Re: Is it worth storing in ORC for one time read. And can be replace hive with HBase

2015-08-06 Thread venkatesh b
I'm really sorry, by mistake I posted in spark mailing list. Jorn Frankie Thanks for your reply. I have many joins, many complex queries and all are table scans. So I think HBase do not work for me. On Thursday, August 6, 2015, Jörn Franke jornfra...@gmail.com wrote: Additionally it is of key

Re: subscribe

2015-08-06 Thread Akhil Das
Welcome aboard! Thanks Best Regards On Thu, Aug 6, 2015 at 11:21 AM, Franc Carter franc.car...@rozettatech.com wrote: subscribe

Re: subscribe

2015-08-06 Thread Ted Yu
See http://spark.apache.org/community.html Cheers On Aug 5, 2015, at 10:51 PM, Franc Carter franc.car...@rozettatech.com wrote: subscribe

Re: Enum values in custom objects mess up RDD operations

2015-08-06 Thread Sebastian Kalix
Thanks a lot Igor, the following hashCode function is stable: @Override public int hashCode() { int hash = 5; hash = 41 * hash + this.myEnum.ordinal(); return hash; } For anyone having the same problem:

Re: Pause Spark Streaming reading or sampling streaming data

2015-08-06 Thread Dimitris Kouzis - Loukas
Hi, - yes - it's great that you wrote it yourself - it means you have more control. I have the feeling that the most efficient point to discard as much data as possible - or even modify your subscription protocol to - your spark input source - not even receive the other 50 seconds of data is the

Re: Pause Spark Streaming reading or sampling streaming data

2015-08-06 Thread Dimitris Kouzis - Loukas
Re-reading your description - I guess you could potentially make your input source to connect for 10 seconds, pause for 50 and then reconnect. On Thu, Aug 6, 2015 at 10:32 AM, Dimitris Kouzis - Loukas look...@gmail.com wrote: Hi, - yes - it's great that you wrote it yourself - it means you

Aggregate by timestamp from json message

2015-08-06 Thread vchandra
Hi team, I am very new to SPARK, actually today is my first day. I have a nested json string which contains timestamp and lot of other details. I have json messages from which I need to write multiple aggregation but for now I need to write one aggregation. If code

Re: No Twitter Input from Kafka to Spark Streaming

2015-08-06 Thread Akhil Das
You just pasted your twitter credentials, consider changing it. :/ Thanks Best Regards On Wed, Aug 5, 2015 at 10:07 PM, narendra narencs...@gmail.com wrote: Thanks Akash for the answer. I added endpoint to the listener and now it is working. -- View this message in context:

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-06 Thread Philip Weaver
I built spark from the v1.5.0-snapshot-20150803 tag in the repo and tried again. The initialization time is about 1 minute now, which is still pretty terrible. On Wed, Aug 5, 2015 at 9:08 PM, Philip Weaver philip.wea...@gmail.com wrote: Absolutely, thanks! On Wed, Aug 5, 2015 at 9:07 PM,

Multiple Thrift servers on one Spark cluster

2015-08-06 Thread Bojan Kostic
Hi, Is there a way to instantiate multiple Thrift servers on one Spark Cluster? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-Thrift-servers-on-one-Spark-cluster-tp24148.html Sent from the Apache Spark User List mailing list archive at

Enum values in custom objects mess up RDD operations

2015-08-06 Thread Warfish
Hi everyone, I was working with Spark for a little while now and have encountered a very strange behaviour that caused me a lot of headaches: I have written my own POJOs to encapsulate my data and this data is held in some JavaRDDs. Part of these POJOs is a member variable of a custom enum type.

Re: Enum values in custom objects mess up RDD operations

2015-08-06 Thread Igor Berman
enums hashcode is jvm instance specific(ie. different jvms will give you different values), so you can use ordinal in hashCode computation or use hashCode on enums ordinal as part of hashCode computation On 6 August 2015 at 11:41, Warfish sebastian.ka...@gmail.com wrote: Hi everyone, I was

Re: spark hangs at broadcasting during a filter

2015-08-06 Thread Alex Gittens
Thanks. Repartitioning to a smaller number of partitions seems to fix my issue, but I'll keep broadcasting in mind (droprows is an integer array with about 4 million entries). On Wed, Aug 5, 2015 at 12:34 PM, Philip Weaver philip.wea...@gmail.com wrote: How big is droprows? Try explicitly

Temp file missing when training logistic regression

2015-08-06 Thread Cat
Hello, I am using the Python API to perform a grid search and train models using LogisticRegressionWithSGD. I am using r3.xl machines in EC2, running on top of YARN in cluster mode. The training RDD is persisted in memory and on disk. Some of the models train successfully, but then at some

How to binarize data in spark

2015-08-06 Thread Adamantios Corais
I have a set of data based on which I want to create a classification model. Each row has the following form: user1,class1,product1 user1,class1,product2 user1,class1,product5 user2,class1,product2 user2,class1,product5 user3,class2,product1 etc There are about 1M users, 2 classes, and 1M

Terminate streaming app on cluster restart

2015-08-06 Thread Alexander Krasheninnikov
Hello, everyone! I have a case, when running standalone cluster: on master stop-all.sh/star-all.sh are invoked, streaming app loses all it's executors, but does not interrupt. Since it is a streaming app, expected to get it's results ASAP, an downtime is undesirable. Is there any workaround

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-06 Thread Cheng Lian
Would you mind to provide the driver log? On 8/6/15 3:58 PM, Philip Weaver wrote: I built spark from the v1.5.0-snapshot-20150803 tag in the repo and tried again. The initialization time is about 1 minute now, which is still pretty terrible. On Wed, Aug 5, 2015 at 9:08 PM, Philip Weaver