Re: Issue of Hive parquet partitioned table schema mismatch

2015-11-03 Thread Cheng Lian
SPARK-11153 should be irrelevant because you are filtering on a partition key while SPARK-11153 is about Parquet filter push-down and doesn't affect partition pruning. Cheng On 11/3/15 7:14 PM, Rex Xiong wrote: We found the query performance is very poor due to this issue

Re: How to handle Option[Int] in dataframe

2015-11-03 Thread Michael Armbrust
In Spark 1.6 there is an experimental new features called Datasets. You can call df.as[Student] and it should do what you want. Would love any feedback you have if you get a chance to try it out (we'll hopefully publish a preview release next week). On Mon, Nov 2, 2015 at 9:30 PM, manas kar

Re: Does the Standalone cluster and Applications need to be same Spark version?

2015-11-03 Thread Saisai Shao
I think it can be worked unless you use some new APIs that only exists in 1.5.1 release (mostly this will not happened). You'd better take a try to see if it can be run or not. On Tue, Nov 3, 2015 at 10:11 AM, pnpritchard < nicholas.pritch...@falkonry.com> wrote: > The title gives the gist of

Re: Improve parquet write speed to HDFS and spark.sql.execution.id is already set ERROR

2015-11-03 Thread Ted Yu
I am a bit curious: why is the synchronization on finalLock is needed ? Thanks > On Oct 23, 2015, at 8:25 AM, Anubhav Agarwal wrote: > > I have a spark job that creates 6 million rows in RDDs. I convert the RDD > into Data-frame and write it to HDFS. Currently it takes 3

How to enable debug in Spark Streaming?

2015-11-03 Thread diplomatic Guru
I have an issue with a Spark Streaming job that appears to be running but not producing any results. Therefore, I would like to enable the debugging mode to get much logging as possible.

Re: Prevent partitions from moving

2015-11-03 Thread Akhil Das
Most likely in your case, the partition keys are not evenly distributed and hence you can notice some of your tasks taking way too longer time to process. You will have to use custom partitioner

Re: Apache Spark on Raspberry Pi Cluster with Docker

2015-11-03 Thread Akhil Das
Can you try it with just: spark-submit --master spark://master:6066 --class SimpleApp target/simple-project-1.0.jar And see if it works? Even better idea would be to spawn a spark-shell (*MASTER=spark://master:6066 bin/spark-shell*) and try out a simple *sc.parallelize(1 to 1000).collect*

Re: Issue of Hive parquet partitioned table schema mismatch

2015-11-03 Thread Rex Xiong
We found the query performance is very poor due to this issue https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-11153 We usually use filter on partition key, the date, it's in string type in 1.3.1 and works great. But in 1.5, it needs to do parquet scan for all partitions.

Re: SparkSQL implicit conversion on insert

2015-11-03 Thread Michael Armbrust
Today you have to do an explicit conversion. I'd really like to open up a public UDT interface as part of Spark Datasets (SPARK-) that would allow you to register custom classes with conversions, but this won't happen till Spark 1.7 likely. On Mon, Nov 2, 2015 at 8:40 PM, Bryan Jeffrey

Re: Improve parquet write speed to HDFS and spark.sql.execution.id is already set ERROR

2015-11-03 Thread Anubhav Agarwal
I was getting the following error without it:- org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /.gz.parquet (inode ): File does not exist. [Lease. Holder: DFSClient_NONMAPREDUCE_, pendingcreates: 1] I think that is due to deadlock.

collect() local faster than 4 node cluster

2015-11-03 Thread Sebastian Kuepers
Hey, with collect() RDDs elements are send as a list back to the driver. If have a 4 node cluster (based on Mesos) in a datacenter and I have my local dev machine. I work with a small 200MB dataset just for testing during development right now. The collect() tasks are running for times

Re: Vague Spark SQL error message with saveAsParquetFile

2015-11-03 Thread Zhan Zhang
Looks like some JVM got killed or OOM. You can check the log to see the real causes. Thanks. Zhan Zhang On Nov 3, 2015, at 9:23 AM, YaoPau > wrote: java.io.FileNotFoun

Vague Spark SQL error message with saveAsParquetFile

2015-11-03 Thread YaoPau
I'm using Spark SQL to query one partition at a time of Hive external table that sits atop .gzip data, and then I'm saving that partition to a new HDFS location as a set of parquet snappy files using .saveAsParquetFile() The query completes successfully, but then I get a vague error message I

Where does mllib's .save method save a model to?

2015-11-03 Thread xenocyon
I want to save an mllib model to disk, and am trying the model.save operation as described in http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html#examples: model.save(sc, "myModelPath") But after running it, I am unable to find any newly created file or dir by the name

Re: Standalone cluster not using multiple workers for single application

2015-11-03 Thread Jeff Jones
With the default configuration SparkTC won’t run on my cluster. The log has: 15/11/03 17:50:13 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources With the SparkUI Completed Applications

bin/pyspark SparkContext is missing?

2015-11-03 Thread Andy Davidson
I am having a heck of a time getting Ipython notebooks to work on my 1.5.1 AWS cluster I created using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 I have read the instructions for using iPython notebook on http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell I want to run the

Re: Spark dynamic allocation config

2015-11-03 Thread Marcelo Vanzin
Hi, your question is really CM-related and not Spark-related, so I'm bcc'ing the list and will reply separately. On Tue, Nov 3, 2015 at 11:08 AM, billou2k wrote: > Hi, > Sorry this is probably a silly question but > I have a standard CDH 5.4.2 config with Spark 1.3 and

Frozen exception while dynamically creating classes inside Spark using JavaAssist API

2015-11-03 Thread Rachana Srivastava
I am trying to dynamically create a new class in Spark using javaassist API. The code seems very simple just invoking makeClass API on a hardcoded class name. The code works find outside Spark environment but getting this chedkNotFrozen exception when I am running the code inside Spark Code

kerberos question

2015-11-03 Thread Chen Song
We saw the following error happening in Spark Streaming job. Our job is running on YARN with kerberos enabled. First, warnings below were printed out, I only pasted a few but the following was repeated hundred/thousand of times. 15/11/03 14:43:07 WARN UserGroupInformation:

Re: Very slow performance on very small record counts

2015-11-03 Thread Cody Koeninger
I had put in a patch to improve the performance of count(), take(), and isEmpty() on KafkaRDD that should be in spark 1.5.1... My bet is because you were doing the isEmpty after the map, it was using the implementations on MapPartitionsRDD, not KafkaRDD. If things are working now you may not

Re: apply simplex method to fix linear programming in spark

2015-11-03 Thread Debasish Das
Spark has nnls in mllib optimization. I have refactored nnls to breeze as well but we could not move out nnls from mllib due to some runtime issues from breeze. Issue in spark or breeze nnls is that it takes dense gram matrix which does not scale if rank is high but it has been working fine for

best practices machine learning with python 2 or 3?

2015-11-03 Thread Andy Davidson
I am fairly new to python and am starting a new project that will want to make use of Spark and the python machine learning libraries (matplotlib, pandas, Š) . I noticed that the spark-c2 script set up my AWS cluster with python 2.6 and 2.7

Limit the size of /tmp/[...].inprogress files in Spark Streaming

2015-11-03 Thread Mathieu Garstecki
Hello, I'm having trouble with the "inprogress" files generated per application in /tmp with Spark Streaming. They seem to grow continually and never shrink, and end up filling the /tmp partition. I haven't found too much litterature on those files. I've tried to set spark.cleaner.ttl to a

RE: Very slow performance on very small record counts

2015-11-03 Thread Young, Matthew T
+user to potentially help others Cody, Thanks for calling out isEmpty, I didn’t realize that it was so dangerous. Taking that out and just reusing the count has eliminated the issue, and now the cluster is happily eating 400,000 record batches. For completeness’ sake: I am using the direct

Spark dynamic allocation config

2015-11-03 Thread billou2k
Hi, Sorry this is probably a silly question but I have a standard CDH 5.4.2 config with Spark 1.3 and I'm trying to setup Spark dynamic allocation which was introduced in CDH 5.4.x and Spark 1.2. According to the doc

Re: ClassNotFoundException even if class is present in Jarfile

2015-11-03 Thread hveiga
It turned out to be a problem with `SerializationUtils` from Apache Commons Lang. There is an open issue where the class will throw a `ClassNotFoundException` even if the class is in the classpath in a multiple-classloader environment: https://issues.apache.org/jira/browse/LANG-1049 We moved away

Why some executors are lazy?

2015-11-03 Thread Khaled Ammar
Hi, I'm using the most recent Spark version on a standalone setup of 16+1 machines. While running GraphX workloads, I found that some executors are lazy? They *rarely* participate in computation. This causes some other executors to do their work. This behavior is consistent in all iterations and

Support Ordering on UserDefinedType

2015-11-03 Thread Ionized
TypeUtils.getInterpretedOrdering currently only supports AtomicType and StructType. Is it possible to add support for UserDefinedType as well? - Paul

error with saveAsTextFile in local directory

2015-11-03 Thread Jack Yang
Hi all, I am saving some hive- query results into the local directory: val hdfsFilePath = "hdfs://master:ip/ tempFile "; val localFilePath = "file:///home/hduser/tempFile"; hiveContext.sql(s"""my hql codes here""") res.printSchema() --working res.show() --working res.map{ x => tranRow2Str(x)

Re: kinesis batches hang after YARN automatic driver restart

2015-11-03 Thread Hster Geguri
Hello Tathagata, Thank you for responding. I have read your excellent article on Zero Data Loss many many times. The Spark Streaming screen shows KCL consistently pulling events from the stream after half a minute as per usual which gets queued up. It's always the first two batches (0 events

Spark Streaming saveAsTextFiles to Amazon S3

2015-11-03 Thread Yuan Zhang
Hi all, I am running a Spark streaming job on Amazon EMR. I am trying to save DStream to S3. How can I set up S3 access ID/key to save files to S3 under a different account from the EMR account? Thanks! Best Regards, Nick

Fwd: Where does mllib's .save method save a model to?

2015-11-03 Thread Simon Hafner
2015-11-03 20:26 GMT+01:00 xenocyon : > I want to save an mllib model to disk, and am trying the model.save > operation as described in > http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html#examples: > > model.save(sc, "myModelPath") > > But after running

Fwd: collect() local faster than 4 node cluster

2015-11-03 Thread Simon Hafner
2015-11-03 20:07 GMT+01:00 Sebastian Kuepers : > Hey, > > with collect() RDDs elements are send as a list back to the driver. > > If have a 4 node cluster (based on Mesos) in a datacenter and I have my > local dev machine. > > I work with a small 200MB

Please reply if you use Mesos fine grained mode

2015-11-03 Thread Reynold Xin
If you are using Spark with Mesos fine grained mode, can you please respond to this email explaining why you use it over the coarse grained mode? Thanks.

New Apache Spark Meetup NRW, Germany

2015-11-03 Thread pchundi
Hi, After attending the Spark Summit Europe 2015, I have started a Spark meetup group for the German State of NordRhein-Westfalen. It would be great if you could add it to the list of meet up's on the Apache Spark page. http://www.meetup.com/spark-users-NRW/

Re: error with saveAsTextFile in local directory

2015-11-03 Thread Ted Yu
Looks like you were running 1.4.x or earlier release because the allowLocal flag is deprecated as of Spark 1.5.0+. Cheers On Tue, Nov 3, 2015 at 3:07 PM, Jack Yang wrote: > Hi all, > > > > I am saving some hive- query results into the local directory: > > > > val hdfsFilePath

Re: Please reply if you use Mesos fine grained mode

2015-11-03 Thread Jerry Lam
We "used" Spark on Mesos to build interactive data analysis platform because the interactive session could be long and might not use Spark for the entire session. It is very wasteful of resources if we used the coarse-grained mode because it keeps resource for the entire session. Therefore,

Re: Please reply if you use Mesos fine grained mode

2015-11-03 Thread Soren Macbeth
we use fine-grained mode. coarse-grained mode keeps JVMs around which often leads to OOMs, which in turn kill the entire executor, causing entire stages to be retried. In fine-grained mode, only the task fails and subsequently gets retried without taking out an entire stage or worse. On Tue, Nov

Upgrade spark cluster to latest version

2015-11-03 Thread roni
Hi Spark experts, This may be a very naive question but can you pl. point me to a proper way to upgrade spark version on an existing cluster. Thanks Roni > Hi, > I have a current cluster running spark 1.4 and want to upgrade to latest > version. > How can I do it without creating a new

how to get Spark stage DAGs thru the REST APIs?

2015-11-03 Thread Xiaoyong Zhu
Hi experts It seems that for the below Spark Stage DAGs, they are available for the Spark UI/Spark History, however they are not available from any of the Spark REST APIs. Not sure if I missed anything if we want to get such kind of data from

Re: Please reply if you use Mesos fine grained mode

2015-11-03 Thread Reynold Xin
Soren, If I understand how Mesos works correctly, even the fine grained mode keeps the JVMs around? On Tue, Nov 3, 2015 at 4:22 PM, Soren Macbeth wrote: > we use fine-grained mode. coarse-grained mode keeps JVMs around which > often leads to OOMs, which in turn kill the

Re: Upgrade spark cluster to latest version

2015-11-03 Thread Zhan Zhang
Spark is a client library. You can just download the latest release or build on you own, and replace your existing one without changing you existing cluster. Thanks. Zhan Zhang On Nov 3, 2015, at 3:58 PM, roni > wrote: Hi Spark experts,

Rule Engine for Spark

2015-11-03 Thread Cassa L
Hi, Has anyone used rule engine with spark streaming? I have a case where data is streaming from Kafka and I need to apply some rules on it (instead of hard coding in a code). Thanks, LCassa

Re: Support Ordering on UserDefinedType

2015-11-03 Thread Simon Hafner
2015-11-03 23:20 GMT+01:00 Ionized : > TypeUtils.getInterpretedOrdering currently only supports AtomicType and > StructType. Is it possible to add support for UserDefinedType as well? Yes, make a PR to spark.

Re: Re: --jars option using hdfs jars cannot effect when spark standlone deploymode with cluster

2015-11-03 Thread our...@cnsuning.com
Akhil, In locally ,all nodes will has the same jar because the driver will be assgined to random node ;otherwise the driver log wiil report :no jar was founded . Ricky Ou(欧 锐) From: Akhil Das Date: 2015-11-02 17:59 To: our...@cnsuning.com CC: user; 494165115

Re: Exception while reading from kafka stream

2015-11-03 Thread Ramkumar V
Thanks a lot , it worked for me. I'm using single direct stream which retrieves data from all the topic. *Thanks*, On Mon, Nov 2, 2015 at 8:13 PM, Cody Koeninger wrote: > combine topicsSet_1 and topicsSet_2 in a single

PMML version in MLLib

2015-11-03 Thread Fazlan Nazeem
Hi, Can I know which version of PMML is used in MLLIb's PMML export functionality for Spark 1.4.1 and Spark 1.5.1? I couldn't find this information within the documentation. If present in documentation please provide me the source. Thanks & Regards, Fazlan Nazeem *Software Engineer* *WSO2

Re: Please reply if you use Mesos fine grained mode

2015-11-03 Thread Timothy Chen
Fine grain mode does reuse the same JVM but perhaps different placement or different allocated cores comparing to the same total memory allocation. Tim Sent from my iPhone > On Nov 3, 2015, at 6:00 PM, Reynold Xin wrote: > > Soren, > > If I understand how Mesos works

RE: error with saveAsTextFile in local directory

2015-11-03 Thread Jack Yang
Yes. My one is 1.4.0. Then is this problem to do with the version? I doubt that. Any comments please? From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Wednesday, 4 November 2015 11:52 AM To: Jack Yang Cc: user@spark.apache.org Subject: Re: error with saveAsTextFile in local directory Looks

Spark 1.5.1 on Mesos NO Executor Java Options

2015-11-03 Thread Jo Voordeckers
Hi everyone, I'm trying to setup Spark 1.5.1 with mesos and the Cluster Dispatcher that I'm currently running on one of the slaves. We're migrating from a 1.3 standalone cluster and we're hoping to benefit from dynamic resource allocation with fine grained mesos for a better distribution of

Re: Please reply if you use Mesos fine grained mode

2015-11-03 Thread MEETHU MATHEW
Hi, We are using Mesos fine grained mode because we can have multiple instances of spark to share machines and each application get resources dynamically allocated.  Thanks & Regards,  Meethu M On Wednesday, 4 November 2015 5:24 AM, Reynold Xin wrote: If you

Re: collect() local faster than 4 node cluster

2015-11-03 Thread Sebastian Kuepers
I could actually figure out, that it had to do with the Mesos Run Mode of Spark. Setting spark.mesos.coarse to true made all the difference. So the primary performance bummer was actually the fine-grained mode and therefore Mesos overhead. Thanks! Sebastian 2015-11-03 20:07 GMT+01:00 Sebastian

What does "write time" means exactly in Spark UI?

2015-11-03 Thread Khaled Ammar
Hi, I wonder what does write time means exactly? I run GraphX workloads and noticed the main bottleneck in most stages is one or two tasks takes too long in "write time" and delay the whole job. Enabling speculation helps a little but I am still interested to know how to fix that? I use

dataframe slow down with tungsten turn on

2015-11-03 Thread gen tang
Hi sparkers, I am using dataframe to do some large ETL jobs. More precisely, I create dataframe from HIVE table and do some operations. And then I save it as json. When I used spark-1.4.1, the whole process is quite fast, about 1 mins. However, when I use the same code with spark-1.5.1(with

Checkpoint not working after driver restart

2015-11-03 Thread vimal dinakaran
I have a simple spark streaming application which reads the data from the rabbitMQ and does some aggregation on window interval of 1 min and 1 hour for batch interval of 30s. I have a three node setup. And to enable checkpoint, I have mounted the same directory using sshfs to all worker node

Re: ClassNotFoundException even if class is present in Jarfile

2015-11-03 Thread Iulian Dragoș
Where is the exception thrown (full stack trace)? How are you running your application, via spark-submit or spark-shell? On Tue, Nov 3, 2015 at 1:43 AM, hveiga wrote: > Hello, > > I am facing an issue where I cannot run my Spark job in a cluster > environment (standalone or

Re: How to enable debug in Spark Streaming?

2015-11-03 Thread Ted Yu
Take a look at: http://search-hadoop.com/m/q3RTtxRM5d2SLnmQ1=Re+Override+Logging+with+spark+streaming On Tue, Nov 3, 2015 at 5:29 AM, diplomatic Guru wrote: > I have an issue with a Spark Streaming job that appears to be running but > not producing any results.

Re: spark read data from aws s3

2015-11-03 Thread hveiga
You also need to have library hadoop-aws in your classpath. From Hadoop 2.6, the AWS libraries come in that separate library. Also, you will need this line in your hadoop configuration: hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem") -- View this message in

Re: kinesis batches hang after YARN automatic driver restart

2015-11-03 Thread Tathagata Das
The Kinesis integration underneath uses the KCL libraries which takes a minute or so sometimes to spin up the threads and start getting data from Kinesis. That is under normal conditions. In your case, it could be happening that because of your killing and restarting, the restarted KCL may be