Re: collect failed for unknow reason when deploy use standalone mode

2014-08-11 Thread jeanlyn92
Hi wangyi: have more detail information? I guess it maybe caused by need a jars that havn't upload to the workers,such as your main class. ./bin/spark-class org.apache.spark.deploy.Client launch [client-options] \ cluster-url application-jar-url main-class \ [application-options]

Spark RuntimeException due to Unsupported datatype NullType

2014-08-11 Thread rafeeq s
Hi, *Spark RuntimeException due to Unsupported datatype NullType , *When saving null primitives *jsonRDD *with *.saveAsParquetFile()* *Code: I am trying to* store jsonRDD into Parquet file using *saveAsParquetFile with below code.* JavaRDDString javaRDD =

Re: error with pyspark

2014-08-11 Thread Ron Gonzalez
If you're running on Ubuntu, do ulimit -n, which gives the max number of allowed open files. You will have to change the value in /etc/security/limits.conf to something like 1, logout and log back in. Thanks, Ron Sent from my iPad On Aug 10, 2014, at 10:19 PM, Davies Liu

spark sql (can it call impala udf)

2014-08-11 Thread marspoc
I want to do the below query that I run in impala calling a c++ UDF in spark sql. In which pnl_flat_pp and pfh_flat are both impala table with partitioned. Can Spark Sql does that? select a.pnl_type_code,percentile_udf_cloudera(cast(90.0 as

[spark-streaming] kafka source and flow control

2014-08-11 Thread gpasquiers
Hi, I’m new to this mailing list as well as spark-streaming. I’m using spark-streaming in a cloudera environment to consume a kafka source and store all data into hdfs. There is a great volume of data and our issue is that the kafka consumer is going too fast for HDFS, it fills up the storage

Re: CDH5, HiveContext, Parquet

2014-08-11 Thread chutium
hive-thriftserver does not work with parquet tables in hive metastore also, this PR will fix it too? do not need to change any pom.xml ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CDH5-HiveContext-Parquet-tp11853p11880.html Sent from the Apache Spark

Re: Low Performance of Shark over Spark.

2014-08-11 Thread vinay . kashyap
Hi Yana, I notice there is GC happening in every executor which is around 400ms on an average. Do you think it is a major impact on the overall query time..?? And regarding the memory for a single worker, I have tried distributing the memory by increasing the number of workers per node and

Re: How to direct insert vaules into SparkSQL tables?

2014-08-11 Thread chutium
no, spark sql can not insert or update textfile yet, can only insert into parquet files but, people.union(new_people).registerAsTable(people) could be an idea. -- View this message in context:

RE: [spark-streaming] kafka source and flow control

2014-08-11 Thread Gwenhael Pasquiers
Hi, We intend to apply other operations on the data later in the same spark context, but our first step is to archive it. Our goal is somth like this Step 1 : consume kafka Step 2 : archive to hdfs AND send to step 3 Step 3 : transform data Step 4 : save transformed data to HDFS as input for

how to split RDD by key and save to different path

2014-08-11 Thread 诺铁
hi, I have googled and find similar question without good answer, http://stackoverflow.com/questions/24520225/writing-to-hadoop-distributed-file-system-multiple-times-with-spark in short, I would like to separate raw data and divide by some key, for example, create date, and put the in directory

ERROR UserGroupInformation: Can't find user in Subject:

2014-08-11 Thread Dan Foisy
Hi I've installed Spark on a Windows 7 machine. I can get the SparkShell up and running but when running through the simple example in Getting Started, I get the following error (tried running as administrator as well) - any ideas? scala val textFile = sc.textFile(README.md) 14/08/11 08:55:52

RE: [spark-streaming] kafka source and flow control

2014-08-11 Thread Gwenhael Pasquiers
I didn’t reply to the last part of your message: My source is Kafka, kafka already acts as a buffer with a lot of space. So when I start my spark job, there is a lot of data to catch up (and it is critical not to lose any), but the kafka consumer goes as fast as it can (and it’s faster than my

looking for a definitive RDD.Pipe() example?

2014-08-11 Thread pjv0580
All, I have been searching the web for a few days looking for a definitive Spark/Spark Streaming RDD.Pipe() example and cannot find one. Would it be possible to share with the group an example of the both the Java/Scala side as well as the script (ex Python) side? Any help or response would be

Re: error with pyspark

2014-08-11 Thread Baoqiang Cao
Thanks Daves and Ron! It indeed was due to ulimit issue. Thanks a lot! Best, Baoqiang Cao Blog: http://baoqiang.org Email: bqcaom...@gmail.com On Aug 11, 2014, at 3:08 AM, Ron Gonzalez zlgonza...@yahoo.com wrote: If you're running on Ubuntu, do ulimit -n, which gives the max number of

Spark app slowing down and I'm unable to kill it

2014-08-11 Thread Grzegorz Białek
Hi, I ran Spark application in local mode with command: $SPARK_HOME/bin/spark-submit --driver-memory 1g class jar with set master=local. After around 10 minutes of computing it started to slow down significantly that next stage took around 50 minutes and next after 5 hours in 80% done and CPU

Re: Spark app slowing down and I'm unable to kill it

2014-08-11 Thread Grzegorz Białek
I'm using Spark 1.0.0 On Mon, Aug 11, 2014 at 4:14 PM, Grzegorz Białek grzegorz.bia...@codilime.com wrote: Hi, I ran Spark application in local mode with command: $SPARK_HOME/bin/spark-submit --driver-memory 1g class jar with set master=local. After around 10 minutes of computing it

Re: saveAsTextFiles file not found exception

2014-08-11 Thread Chen Song
I got the same exception after the streaming job runs for a while, The ERROR message was complaining about a temp file not being found in the output folder. 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job 140774430 ms.0 java.io.FileNotFoundException: File

Re: saveAsTextFiles file not found exception

2014-08-11 Thread Chen Song
The exception was thrown out in application master(spark streaming driver) and the job shut down after this exception. On Mon, Aug 11, 2014 at 10:29 AM, Chen Song chen.song...@gmail.com wrote: I got the same exception after the streaming job runs for a while, The ERROR message was complaining

share/reuse off-heap persisted (tachyon) RDD in SparkContext or saveAsParquetFile on tachyon in SQLContext

2014-08-11 Thread chutium
sharing /reusing RDDs is always useful for many use cases, is this possible via persisting RDD on tachyon? such as off heap persist a named RDD into a given path (instead of /tmp_spark_tachyon/spark-xxx-xxx-xxx) or saveAsParquetFile on tachyon i tried to save a SchemaRDD on tachyon, val

Re: saveAsTextFiles file not found exception

2014-08-11 Thread Chen Song
Bill Did you get this resolved somehow? Anyone has any insight into this problem? Chen On Mon, Aug 11, 2014 at 10:30 AM, Chen Song chen.song...@gmail.com wrote: The exception was thrown out in application master(spark streaming driver) and the job shut down after this exception. On Mon,

Re: saveAsTextFiles file not found exception

2014-08-11 Thread Andrew Ash
I've also been seeing similar stacktraces on Spark core (not streaming) and have a theory it's related to spark.speculation being turned on. Do you have that enabled by chance? On Mon, Aug 11, 2014 at 8:10 AM, Chen Song chen.song...@gmail.com wrote: Bill Did you get this resolved somehow?

Parallelizing a task makes it freeze

2014-08-11 Thread sparkuser2345
I have an array 'dataAll' of key-value pairs where each value is an array of arrays. I would like to parallelize a task over the elements of 'dataAll' to the workers. In the dummy example below, the number of elements in 'dataAll' is 3 but in real application it would be tens to hundreds.

Re: Can I share the RDD between multiprocess

2014-08-11 Thread coolfrood
Reviving this discussion again... I'm interested in using Spark as the engine for a web service. The SparkContext and its RDDs only exist in the JVM that started it. While RDDs are resilient, this means the context owner isn't resilient, so I may be able to serve requests out of a single

Re: ClassNotFound for user class in uber-jar

2014-08-11 Thread lbustelo
I've see this same exact problem too and I've been ignoring, but I wonder if I'm loosing data. Can anyone at least comment on this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ClassNotFound-for-user-class-in-uber-jar-tp10613p11902.html Sent from the

ClassNotFound exception on class in uber.jar

2014-08-11 Thread lbustelo
Not sure if this problem reached the Spark guys because it shows in Nabble that This post has NOT been accepted by the mailing list yet. http://apache-spark-user-list.1001560.n3.nabble.com/ClassNotFound-for-user-class-in-uber-jar-td10613.html#a11902 I'm resubmitting. Greetings, I'm currently

Re: saveAsTextFiles file not found exception

2014-08-11 Thread Chen Song
Andrew that is a good finding. Yes, I have speculative execution turned on, becauseI saw tasks stalled on HDFS client. If I turned off speculative execution, is there a way to circumvent the hanging task issue? On Mon, Aug 11, 2014 at 11:13 AM, Andrew Ash and...@andrewash.com wrote: I've

Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-08-11 Thread Cheng Lian
Since you were using hql(...), it’s probably not related to JDBC driver. But I failed to reproduce this issue locally with a single node pseudo distributed YARN cluster. Would you mind to elaborate more about steps to reproduce this bug? Thanks ​ On Sun, Aug 10, 2014 at 9:36 PM, Cheng Lian

Re: Spark SQL JDBC

2014-08-11 Thread Cheng Lian
Hi John, the JDBC Thrift server resides in its own build profile and need to be enabled explicitly by ./sbt/sbt -Phive-thriftserver assembly. ​ On Tue, Aug 5, 2014 at 4:54 AM, John Omernik j...@omernik.com wrote: I am using spark-1.1.0-SNAPSHOT right now and trying to get familiar with the

Re: increase parallelism of reading from hdfs

2014-08-11 Thread Paul Hamilton
Hi Chen, You need to set the max input split size so that the underlying hadoop libraries will calculate the splits appropriately. I have done the following successfully: val job = new Job() FileInputFormat.setMaxInputSplitSize(job, 12800L) And then use job.getConfiguration when creating a

Re: Running a task once on each executor

2014-08-11 Thread RodrigoB
Hi Christopher, I am also in the need of having a single function call on the node level. Your suggestion makes sense as a solution to the requirement, but still feels like a workaround, this check will get called on every row...Also having static members and methods created specially on a

Random Forest implementation in MLib

2014-08-11 Thread Sameer Tilak
Hi All,I read on the mailing list that random forest implementation was on the roadmap. I wanted to check about its status? We are currently using Weka and would like to move over to MLib for performance.

Re: Can I share the RDD between multiprocess

2014-08-11 Thread Ruchir Jha
Look at: https://github.com/ooyala/spark-jobserver On Mon, Aug 11, 2014 at 11:48 AM, coolfrood aara...@quantcast.com wrote: Reviving this discussion again... I'm interested in using Spark as the engine for a web service. The SparkContext and its RDDs only exist in the JVM that started it.

Re: [MLLib]:choosing the Loss function

2014-08-11 Thread SK
Hi, Thanks for the reference to the LBFGS optimizer. I tried to use the LBFGS optimizer, but I am not able to pass it as an input to the LogisticRegression model for binary classification. After studying the code in mllib/classification/LogisticRegression.scala, it appears that the only

mllib style

2014-08-11 Thread Koert Kuipers
i was just looking at ALS (mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala) any need all the variables need to be vars and to have all these setters around? it just leads to so much clutter if you really want them to the vars it is safe in scala to make them public

Failed jobs show up as succeeded in YARN?

2014-08-11 Thread Shay Rojansky
Spark 1.0.2, Python, Cloudera 5.1 (Hadoop 2.3.0) It seems that Python jobs I'm sending to YARN show up as succeeded even if they failed... Am I doing something wrong, is this a known issue? Thanks, Shay

Re: spark.files.userClassPathFirst=true Not Working Correctly

2014-08-11 Thread Marcelo Vanzin
Could you share what's the cluster manager you're using and exactly where the error shows up (driver or executor)? A quick look reveals that Standalone and Yarn use different options to control this, for example. (Maybe that already should be a bug.) On Mon, Aug 11, 2014 at 12:24 PM, DNoteboom

Re: Compile spark code with idea succesful but run SparkPi error with java.lang.SecurityException

2014-08-11 Thread Ron's Yahoo!
Not sure what your environment is but this happened to me before because I had a couple of servlet-api jars in the path which did not match. I was building a system that programmatically submitted jobs so I had my own jars that conflicted with that of spark. The solution is do mvn

RE: Spark on an HPC setup

2014-08-11 Thread Sidharth Kashyap
Hi Jeremy, Thanks for the reply. We got Spark on our setup after a similar script was brought up to work with LSF. Really appreciate your help. Will keep in touch on Twitter Thanks,@sidkashyap :) From: freeman.jer...@gmail.com Subject: Re: Spark on an HPC setup Date: Thu, 29 May 2014 00:37:54

Re: spark.files.userClassPathFirst=true Not Working Correctly

2014-08-11 Thread DNoteboom
I'm currently running on my local machine on standalone. The error shows up in my code when I am closing resources using the TaskContext.addOnCompleteCallBack. However, the cause of this error is because of a faulty classLoader which must occur in the Executor in the function createClassLoader.

Gathering Information about Standalone Cluster

2014-08-11 Thread Wonha Ryu
Hey all, Is there any kind of API to access information about resources, executors, and applications in a standalone cluster displayed in the web UI? Currently I'm using 1.0.x, but interested in experimenting with bleeding edge. Thanks, Wonha

Re: Random Forest implementation in MLib

2014-08-11 Thread DB Tsai
We have an open-sourced Random Forest at Alpine Data Labs with the Apache license. We're also trying to have it merged into Spark MLlib now. https://github.com/AlpineNow/alpineml It's been tested a lot, and the accuracy and training time benchmark is great. There could be some bugs here and

Re: [MLLib]:choosing the Loss function

2014-08-11 Thread Burak Yavuz
Hi, // Initialize the optimizer using logistic regression as the loss function with L2 regularization val lbfgs = new LBFGS(new LogisticGradient(), new SquaredL2Updater()) // Set the hyperparameters

Re: Job ACL's on SPark

2014-08-11 Thread Manoj kumar
Hi Friends, Any response on this, I looked into documentation but could not get any information --Manoj On Fri, Aug 8, 2014 at 6:56 AM, Manoj kumar manojkumarr2...@gmail.com wrote: Hi Team, Do we have Job ACL's for Spark which is similar to Hadoop Job ACL’s. Where I can restrict who

Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-08-11 Thread Yin Huai
Hi Jenny, How's your metastore configured for both Hive and Spark SQL? Which metastore mode are you using (based on https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin )? Thanks, Yin On Mon, Aug 11, 2014 at 6:15 PM, Jenny Zhao linlin200...@gmail.com wrote: you can

Re: share/reuse off-heap persisted (tachyon) RDD in SparkContext or saveAsParquetFile on tachyon in SQLContext

2014-08-11 Thread Haoyuan Li
Is the speculative execution enabled? Best, Haoyuan On Mon, Aug 11, 2014 at 8:08 AM, chutium teng@gmail.com wrote: sharing /reusing RDDs is always useful for many use cases, is this possible via persisting RDD on tachyon? such as off heap persist a named RDD into a given path (instead

Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-08-11 Thread Jenny Zhao
Thanks Yin! here is my hive-site.xml, which I copied from $HIVE_HOME/conf, didn't experience problem connecting to the metastore through hive. which uses DB2 as metastore database. ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Licensed to the Apache Software

Re: [spark-streaming] kafka source and flow control

2014-08-11 Thread Tobias Pfeiffer
Hi, On Mon, Aug 11, 2014 at 9:41 PM, Gwenhael Pasquiers gwenhael.pasqui...@ericsson.com wrote: We intend to apply other operations on the data later in the same spark context, but our first step is to archive it. Our goal is somth like this Step 1 : consume kafka Step 2 : archive to

Re: [spark-streaming] kafka source and flow control

2014-08-11 Thread Xuri Nagarin
In general, (and I am prototyping), I have a better idea :) - Consume kafka in Spark from topic-A - transform data in Spark (normalize, enrich etc etc) - Feed it back to Kafka (in a different topic-B) - Have flume-HDFS (for M/R, Impala, Spark batch) or Spark-streaming or any other compute

Using very large files for KMeans training -- cluster centers size?

2014-08-11 Thread durin
I'm trying to apply KMeans training to some text data, which consists of lines that each contain something between 3 and 20 words. For that purpose, all unique words are saved in a dictionary. This dictionary can become very large as no hashing etc. is done, but it should spill to disk in case it

Re: java.lang.StackOverflowError when calling count()

2014-08-11 Thread randylu
hi, TD. I also fall into the trap of long lineage, and your suggestions do work well. But i don't understand why the long lineage can cause stackover, and where it takes effect? -- View this message in context:

KMeans - java.lang.IllegalArgumentException: requirement failed

2014-08-11 Thread Ge, Yao (Y.)
I am trying to train a KMeans model with sparse vector with Spark 1.0.1. When I run the training I got the following exception: java.lang.IllegalArgumentException: requirement failed at scala.Predef$.require(Predef.scala:221) at

Is there any way to control the parallelism in LogisticRegression

2014-08-11 Thread ZHENG, Xu-dong
Hi all, We are trying to use Spark MLlib to train super large data (100M features and 5B rows). The input data in HDFS has ~26K partitions. By default, MLlib will create a task for every partition at each iteration. But because our dimensions are also very high, such large number of tasks will

Transform RDD[List]

2014-08-11 Thread Kevin Jung
Hi It may be simple question, but I can not figure out the most efficient way. There is a RDD containing list. RDD ( List(1,2,3,4,5) List(6,7,8,9,10) ) I want to transform this to RDD ( List(1,6) List(2,7) List(3,8) List(4,9) List(5,10) ) And I want to achieve this without using collect

Re: Transform RDD[List]

2014-08-11 Thread Soumya Simanta
Try something like this. scala val a = sc.parallelize(List(1,2,3,4,5)) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at console:12 scala val b = sc.parallelize(List(6,7,8,9,10)) b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at

Re: Is there any way to control the parallelism in LogisticRegression

2014-08-11 Thread Jiusheng Chen
How about increase HDFS file extent size? like current value is 128M, we make it 512M or bigger. On Tue, Aug 12, 2014 at 11:46 AM, ZHENG, Xu-dong dong...@gmail.com wrote: Hi all, We are trying to use Spark MLlib to train super large data (100M features and 5B rows). The input data in HDFS

Support for ORC Table in Shark/Spark

2014-08-11 Thread vinay . kashyap
Hi all, Is it possible to use table with ORC format in Shark version 0.9.1 with Spark 0.9.2 and Hive version 0.12.0..?? I have tried creating the ORC table in Shark using the below query create table orc_table (x int, y string) stored as orc create table works, but when I try to insert values

How to save mllib model to hdfs and reload it

2014-08-11 Thread XiaoQinyu
hello: I want to know,if I use history data to training model and I want to use this model in other app.How should I do? Should I save this model in disk? And when I use this model then load it from disk.But I don't know how to save the mllib model,and reload it? I will be very pleasure,if

Re: Mllib : Save SVM model to disk

2014-08-11 Thread XiaoQinyu
Have you solved this problem?? And could you share how to save model to hdfs and reload it? Thanks XiaoQinyu -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-Save-SVM-model-to-disk-tp74p11954.html Sent from the Apache Spark User List mailing list

Re: Is there any way to control the parallelism in LogisticRegression

2014-08-11 Thread ZHENG, Xu-dong
I think this has the same effect and issue with #1, right? On Tue, Aug 12, 2014 at 1:08 PM, Jiusheng Chen chenjiush...@gmail.com wrote: How about increase HDFS file extent size? like current value is 128M, we make it 512M or bigger. On Tue, Aug 12, 2014 at 11:46 AM, ZHENG, Xu-dong

Re: saveAsTextFiles file not found exception

2014-08-11 Thread Andrew Ash
Not sure which stalled HDFS client issue your'e referring to, but there was one fixed in Spark 1.0.2 that could help you out -- https://github.com/apache/spark/pull/1409. I've still seen one related to Configuration objects not being threadsafe though so you'd still need to keep speculation on to

Re: Transform RDD[List]

2014-08-11 Thread Kevin Jung
Hi ssimanta. The first line creates RDD[Int], not RDD[List[Int]]. In case of List , I can not zip all list elements in RDD like a.zip(b) and I can not use only tuple2 because realworld RDD has more List elements in source RDD. So I guess the expected result depends on the count of original Lists.