Unable to use "Batch Start Time" on worker nodes.

2015-11-26 Thread Abhishek Anand
Hi , I need to use batch start time in my spark streaming job. I need the value of batch start time inside one of the functions that is called within a flatmap function in java. Please suggest me how this can be done. I tried to use the StreamingListener class and set the value of a variable

Grid search with Random Forest

2015-11-26 Thread Ndjido Ardo Bar
Hi folks, Does anyone know whether the Grid Search capability is enabled since the issue spark-9011 of version 1.4.0 ? I'm getting the "rawPredictionCol column doesn't exist" when trying to perform a grid search with Spark 1.4.0. Cheers, Ardo

Re: Help with Couchbase connector error

2015-11-26 Thread Shixiong Zhu
Het Eyal, I just checked the couchbase spark connector jar. The target version of some of classes are Java 8 (52.0). You can create a ticket in https://issues.couchbase.com/projects/SPARKC Best Regards, Shixiong Zhu 2015-11-26 9:03 GMT-08:00 Ted Yu : > StoreMode is from

Re: Stop Spark yarn-client job

2015-11-26 Thread Jeff Zhang
Could you attach the yarn AM log ? On Fri, Nov 27, 2015 at 8:10 AM, Jagat Singh wrote: > Hi, > > What is the correct way to stop fully the Spark job which is running as > yarn-client using spark-submit. > > We are using sc.stop in the code and can see the job still running

Stop Spark yarn-client job

2015-11-26 Thread Jagat Singh
Hi, What is the correct way to stop fully the Spark job which is running as yarn-client using spark-submit. We are using sc.stop in the code and can see the job still running (in yarn resource manager) after final hive insert is complete. The code flow is start context do somework insert to

Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-26 Thread Gylfi
HDFS has a default replication factor of 3 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-a-3-8-T-dataset-take-up-11-59-Tb-on-HDFS-tp25471p25497.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Millions of entities in custom Hadoop InputFormat and broadcast variable

2015-11-26 Thread Anfernee Xu
Hi Spark experts, First of all, happy Thanksgiving! The comes to my question, I have implemented custom Hadoop InputFormat to load millions of entities from my data source to Spark(as JavaRDD and transform to DataFrame). The approach I took in implementing the custom Hadoop RDD is loading all

Re: GraphX - How to make a directed graph an undirected graph?

2015-11-26 Thread Robineast
1. GraphX doesn't have a concept of undirected graphs, Edges are always specified with a srcId and dstId. However there is nothing to stop you adding in edges that point in the other direction i.e. if you have an edge with srcId -> dstId you can add an edge dstId -> srcId 2. In general APIs will

error while creating HiveContext

2015-11-26 Thread Chandra Mohan, Ananda Vel Murugan
Hi, I am building a spark-sql application in Java. I created a maven project in Eclipse and added all dependencies including spark-core and spark-sql. I am creating HiveContext in my spark program and then try to run sql queries against my Hive Table. When I submit this job in spark, for some

Optimizing large collect operations

2015-11-26 Thread Gylfi
Hi. I am doing very large collectAsMap() operations, about 10,000,000 records, and I am getting "org.apache.spark.SparkException: Error communicating with MapOutputTracker" errors.. details: "org.apache.spark.SparkException: Error communicating with MapOutputTracker at

controlling parquet file sizes for faster transfer to S3 from HDFS

2015-11-26 Thread AlexG
Is there a way to control how large the part- files are for a parquet dataset? I'm currently using e.g. results.toDF.coalesce(60).write.mode("append").parquet(outputdir) to manually reduce the number of parts, but this doesn't map linearly to fewer parts: I noticed that coalescing to 30 actually

starting start-master.sh throws "java.lang.ClassNotFoundException: org.slf4j.Logger" error

2015-11-26 Thread Mich Talebzadeh
Hi, I just built spark without hive jars and trying to run start-master.sh I get this error in the log. Sounds like it cannot find java.lang.ClassNotFoundException: org.slf4j.Logger Spark Command: /usr/java/latest/bin/java -cp

Re: RE: Spark checkpoint problem

2015-11-26 Thread eric wong
I don't think it is a deliberate design. So you may need do action on the RDD before the action of RDD, if you want to explicitly checkpoint RDD. 2015-11-26 13:23 GMT+08:00 wyphao.2007 : > Spark 1.5.2. > > 在 2015-11-26 13:19:39,"张志强(旺轩)" 写道: >

ClassNotFoundException with a uber jar.

2015-11-26 Thread Marc de Palol
Hi all, I have a uber jar made with maven, the contents are: my.org.my.classes.Class ... lib/lib1.jar // 3rd party libs lib/lib2.jar I'm using this kind of jar for hadoop applications and all works fine. I added spark libs, scala and everything needed in spark, but when I submit this jar to

Re: ClassNotFoundException with a uber jar.

2015-11-26 Thread Ali Tajeldin EDU
I'm not %100 sure, but I don't think a jar within a jar will work without a custom class loader. You can perhaps try to use "maven-assembly-plugin" or "maven-shade-plugin" to build your uber/fat jar. Both of these will build a flattened single jar. -- Ali On Nov 26, 2015, at 2:49 AM, Marc de

java.io.FileNotFoundException: Job aborted due to stage failure

2015-11-26 Thread Sahil Sareen
Im using Spark1.4.2 with Hadoop 2.7, I tried increasing spark.shuffle.io.maxRetries to 10 but didn't help. Any ideas on what could be causing this?? This is the exception that I am getting: [MySparkApplication] WARN : Failed to execute SQL statement select * from TableS s join TableC c on

[no subject]

2015-11-26 Thread Dmitry Tolpeko

custom inputformat recordreader

2015-11-26 Thread Patcharee Thongtra
Hi, In python how to use inputformat/custom recordreader? Thanks, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

MySQLSyntaxErrorException when connect hive to sparksql

2015-11-26 Thread luohui20001
hi guys, when I am trying to connect hive with spark-sql,I got a problem like below: [root@master spark]# bin/spark-shell --master local[4]log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).log4j:WARN Please initialize the log4j system

Re: custom inputformat recordreader

2015-11-26 Thread Ted Yu
Please take a look at: python//pyspark/tests.py There're examples using sc.hadoopFile() and sc.newAPIHadoopRDD() Cheers On Thu, Nov 26, 2015 at 4:50 AM, Patcharee Thongtra < patcharee.thong...@uni.no> wrote: > Hi, > > In python how to use inputformat/custom recordreader? > > Thanks, >

Re: java.io.FileNotFoundException: Job aborted due to stage failure

2015-11-26 Thread Ted Yu
bq. (Permission denied) Have you checked the permission for /mnt/md0/var/lib/spark/... ? Cheers On Thu, Nov 26, 2015 at 3:03 AM, Sahil Sareen wrote: > Im using Spark1.4.2 with Hadoop 2.7, I tried increasing > spark.shuffle.io.maxRetries to 10 but didn't help. > > Any

Help with Couchbase connector error

2015-11-26 Thread Eyal Sharon
Hi, I am trying to set a connection to Couchbase. I am at the very beginning, and I got stuck on this exception Exception in thread "main" java.lang.UnsupportedClassVersionError: com/couchbase/spark/StoreMode : Unsupported major.minor version 52.0 Here is the simple code fragment val sc =

Re: Help with Couchbase connector error

2015-11-26 Thread Ted Yu
This implies version mismatch between the JDK used to build your jar and the one at runtime. When building, target JDK 1.7 There're plenty of posts on the web for dealing with such error. Cheers On Thu, Nov 26, 2015 at 7:31 AM, Eyal Sharon wrote: > Hi, > > I am trying to

Re: MySQLSyntaxErrorException when connect hive to sparksql

2015-11-26 Thread Ted Yu
Have you seen this thread ? http://search-hadoop.com/m/q3RTtCoKmv14Hd1H1=Re+Spark+Hive+max+key+length+is+767+bytes On Thu, Nov 26, 2015 at 5:26 AM, wrote: > hi guys, > > when I am trying to connect hive with spark-sql,I got a problem like > below: > > > [root@master

Re: Help with Couchbase connector error

2015-11-26 Thread Eyal Sharon
Hi , Great , that gave some directions. But can you elaborate more? or share some post I am currently running JDK 7 , and my Couchbase too Thanks ! On Thu, Nov 26, 2015 at 6:02 PM, Ted Yu wrote: > This implies version mismatch between the JDK used to build your jar and >

Re: Help with Couchbase connector error

2015-11-26 Thread Ted Yu
StoreMode is from Couchbase connector. Where did you obtain the connector ? See also http://stackoverflow.com/questions/1096148/how-to-check-the-jdk-version-used-to-compile-a-class-file On Thu, Nov 26, 2015 at 8:55 AM, Eyal Sharon wrote: > Hi , > Great , that gave some

question about combining small parquet files

2015-11-26 Thread Nezih Yigitbasi
Hi Spark people, I have a Hive table that has a lot of small parquet files and I am creating a data frame out of it to do some processing, but since I have a large number of splits/files my job creates a lot of tasks, which I don't want. Basically what I want is the same functionality that Hive

Re: question about combining small parquet files

2015-11-26 Thread Ruslan Dautkhanov
An interesting compaction approach of small files is discussed recently http://blog.cloudera.com/blog/2015/11/how-to-ingest-and-query-fast-data-with-impala-without-kudu/ AFAIK Spark supports views too. -- Ruslan Dautkhanov On Thu, Nov 26, 2015 at 10:43 AM, Nezih Yigitbasi <

possible bug spark/python/pyspark/rdd.py portable_hash()

2015-11-26 Thread Andy Davidson
I am using spark-1.5.1-bin-hadoop2.6. I used spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured spark-env to use python3. I get and exception 'Randomness of hash of string should be disabled via PYTHONHASHSEED¹. Is there any reason rdd.py should not just set PYTHONHASHSEED

Re: UDF with 2 arguments

2015-11-26 Thread Daniel Lopes
Thanks Davies and Nathan, I found my error. I was using *ArrayType()* and I need to pass de kind of type has in this array and I has not passing *ArrayType(IntegerType())*. Thanks :) On Wed, Nov 25, 2015 at 7:46 PM, Davies Liu wrote: > It works in master (1.6), what's

Re: error while creating HiveContext

2015-11-26 Thread fightf...@163.com
Hi, I think you just want to put the hive-site.xml in the spark/conf directory and it would load it into spark classpath. Best, Sun. fightf...@163.com From: Chandra Mohan, Ananda Vel Murugan Date: 2015-11-27 15:04 To: user Subject: error while creating HiveContext Hi, I am building a

Spark on yarn vs spark standalone

2015-11-26 Thread cs user
Hi All, Apologies if this question has been asked before. I'd like to know if there are any downsides to running spark over yarn with the --master yarn-cluster option vs having a separate spark standalone cluster to execute jobs? We're looking at installing a hdfs/hadoop cluster with Ambari and

Re: Spark on yarn vs spark standalone

2015-11-26 Thread Jeff Zhang
If your cluster is a dedicated spark cluster (only running spark job, no other jobs like hive/pig/mr), then spark standalone would be fine. Otherwise I think yarn would be a better option. On Fri, Nov 27, 2015 at 3:36 PM, cs user wrote: > Hi All, > > Apologies if this

Re: Optimizing large collect operations

2015-11-26 Thread Jeff Zhang
For such large output, I would suggest you to do the following processing in cluster rather than in driver (use RDD api to do that). If you really want to pull it to driver, then you can first save it in hdfs and then read it using hdfs api to avoid the akka issue On Fri, Nov 27, 2015 at 2:41 PM,