spark-csv package - output to filename.csv?

2015-09-03 Thread Ewan Leith
Using the spark-csv package or outputting to text files, you end up with files named: test.csv/part-00 rather than a more user-friendly "test.csv", even if there's only 1 part file. We can merge the files using the Hadoop merge command with something like this code from

Re: Input size increasing every iteration of gradient boosted trees [1.4]

2015-09-03 Thread Sean Owen
Since it sounds like this has been encountered 3 times, and I've personally seen it and mostly verified it, I think it's legit enough for a JIRA: SPARK-10433 I am sorry to say I don't know what is going here though. On Thu, Sep 3, 2015 at 1:56 PM, Peter Rudenko wrote:

Re: Hbase Lookup

2015-09-03 Thread Tao Lu
Yes. Ayan, you approach will work. Or alternatively, use Spark, and write a Scala/Java function which implements similar logic in your Pig UDF. Both approaches look similar. Personally, I would go with Spark solution, it will be slightly faster, and easier if you already have Spark cluster

Re: Hbase Lookup

2015-09-03 Thread Tao Lu
But I don't see how it works here with phoenix or hbase coprocessor. Remember we are joining 2 big data sets here, one is the big file in HDFS, and records in HBASE. The driving force comes from Hadoop cluster. On Thu, Sep 3, 2015 at 11:37 AM, Jörn Franke wrote: > If

Re: Small File to HDFS

2015-09-03 Thread nibiau
My main question in case of HAR usage is , is it possible to use Pig on it and what about performances ? - Mail original - De: "Jörn Franke" À: nib...@free.fr, user@spark.apache.org Envoyé: Jeudi 3 Septembre 2015 15:54:42 Objet: Re: Small File to HDFS Store

pySpark window functions are not working in the same way as Spark/Scala ones

2015-09-03 Thread Sergey Shcherbakov
Hello all, I'm experimenting with Spark 1.4.1 window functions and have come to a problem in pySpark that I've described in a Stackoverflow question In essence, the wSpec = Window.orderBy(df.a)

Re: Small File to HDFS

2015-09-03 Thread Jörn Franke
Store them as hadoop archive (har) Le mer. 2 sept. 2015 à 18:07, a écrit : > Hello, > I'am currently using Spark Streaming to collect small messages (events) , > size being <50 KB , volume is high (several millions per day) and I have to > store those messages in HDFS. > I

Re: Small File to HDFS

2015-09-03 Thread Tao Lu
Your requirements conflict with each other. 1. You want to dump all your messages somewhere 2. You want to be able to update/delete individual message 3. You don't want to introduce anther NOSQL database(like HBASE) since you already have all messages stored in MongoDB My suggestion is: 1. Don't

Re: Slow Mongo Read from Spark

2015-09-03 Thread Jörn Franke
You might think about another storage layer not being mongodb (hdfs+orc+compression or hdfs+parquet+compression) to improve performance Le jeu. 3 sept. 2015 à 9:15, Akhil Das a écrit : > On SSD you will get around 30-40MB/s on a single machine (on 4 cores). > >

Re: Tuning - tasks per core

2015-09-03 Thread Igor Berman
suppose you have 1 job that do some transformation, suppose you have X cores in your cluster and you are willing to give all of them to your job suppose you have no shuffles(to keep it simple) set number of partitions of your input data to be 3X or 2X, thus you'll get 2/3 tasks per each core On

Re: Small File to HDFS

2015-09-03 Thread Martin Menzel
Hello Nicolas, I solved a similar problem using FSDataOutputStream http://blog.woopi.org/wordpress/files/hadoop-2.6.0-javadoc/org/apache/hadoop/fs/FSDataOutputStream.html Each entry can be a ArrayWritable

RE: How to Take the whole file as a partition

2015-09-03 Thread Ewan Leith
Have a look at the sparkContext.binaryFiles, it works like wholeTextFiles but returns a PortableDataStream per file. It might be a workable solution though you'll need to handle the binary to UTF-8 or equivalent conversion Thanks, Ewan From: Shuai Zheng [mailto:szheng.c...@gmail.com] Sent: 03

NOT IN in Spark SQL

2015-09-03 Thread Pietro Gentile
Hi all, How can I do to use the "NOT IN" clause in Spark SQL 1.2 ?? He continues to give me syntax errors. But the question is correct in SQL. Thanks in advance, Best regards, Pietro.

Re: Hbase Lookup

2015-09-03 Thread Jörn Franke
If you use pig or spark you increase the complexity from an operations management perspective significantly. Spark should be seen from a platform perspective if it make sense. If you can do it directly with hbase/phoenix or only hbase coprocessor then this should be preferred. Otherwise you pay

How to Take the whole file as a partition

2015-09-03 Thread Shuai Zheng
Hi All, I have 1000 files, from 500M to 1-2GB at this moment. And I want my spark can read them as partition on the file level. Which means want the FileSplit turn off. I know there are some solutions, but not very good in my case: 1, I can't use WholeTextFiles method, because my file is

Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-03 Thread Dmitry Goldenberg
I'm seeing an oddity where I initially set the batchdurationmillis to 1 second and it works fine: JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.milliseconds(batchDurationMillis)); Then I tried changing the value to 10 seconds. The change didn't seem to take. I've

Re: How to Take the whole file as a partition

2015-09-03 Thread Tao Lu
You situation is special. It seems to me Spark may not fit well in your case. You want to process the individual files (500M~2G) as a whole, you want good performance. You may want to write our own Scala/Java programs and distribute it along with those files across your cluster, and run them in

Re: Slow Mongo Read from Spark

2015-09-03 Thread Deepesh Maheshwari
Because of existing architecture , i am bound to use mongodb. Please suggest for this On Thu, Sep 3, 2015 at 9:10 PM, Jörn Franke wrote: > You might think about another storage layer not being mongodb > (hdfs+orc+compression or hdfs+parquet+compression) to improve

Re: Is it required to remove checkpoint when submitting a code change?

2015-09-03 Thread Ricardo Luis Silva Paiva
Good tip. I will try that. Thank you. On Wed, Sep 2, 2015 at 6:54 PM, Cody Koeninger wrote: > Yeah, in general if you're changing the jar you can't recover the > checkpoint. > > If you're just changing parameters, why not externalize those in a > configuration file so your

spark.shuffle.spill=false ignored?

2015-09-03 Thread Eric Walker
Hi, I am using Spark 1.3.1 on EMR with lots of memory. I have attempted to run a large pyspark job several times, specifying `spark.shuffle.spill=false` in different ways. It seems that the setting is ignored, at least partially, and some of the tasks start spilling large amounts of data to

Re: large number of import-related function calls in PySpark profile

2015-09-03 Thread Priedhorsky, Reid
On Sep 2, 2015, at 11:31 PM, Davies Liu > wrote: Could you have a short script to reproduce this? Good point. Here you go. This is Python 3.4.3 on Ubuntu 15.04. import pandas as pd # must be in default path for interpreter import pyspark

Re: Too many open files issue

2015-09-03 Thread Sigurd Knippenberg
I don't think that is the issue. I have it setup to run in a thread pool but I have set the pool size to 1 for this test until I get this resolved. I am having some problems with using the Spark web portal since it is picking a random port and with the way my environment is setup, by time I have

Re: Spark partitions from CassandraRDD

2015-09-03 Thread Ankur Srivastava
Hi Alaa, Partition when using CassandraRDD depends on your partition key in Cassandra table. If you see only 1 partition in the RDD it means all the rows you have selected have same partition_key in C* Thanks Ankur On Thu, Sep 3, 2015 at 11:54 AM, Alaa Zubaidi (PDF)

Re: Ranger-like Security on Spark

2015-09-03 Thread Matei Zaharia
If you run on YARN, you can use Kerberos, be authenticated as the right user, etc in the same way as MapReduce jobs. Matei > On Sep 3, 2015, at 1:37 PM, Daniel Schulz > wrote: > > Hi, > > I really enjoy using Spark. An obstacle to sell it to our clients

Re: Problem while loading saved data

2015-09-03 Thread Ewan Leith
>From that, I'd guesd that HDFS isn't setup between the nodes, or for some >reason writes are defaulting to file:///path/ rather than hdfs:///path/ -- Original message-- From: Amila De Silva Date: Thu, 3 Sep 2015 17:12 To: Ewan Leith; Cc: user@spark.apache.org; Subject:Re:

SparkSQL without access to arrays?

2015-09-03 Thread Terry
Hi, i'm using Spark 1.4.1. Here is de printSchema after load my json file: root |-- result: struct (nullable = true) ||-- negative_votes: long (nullable = true) ||-- players: array (nullable = true) ||||-- account_id: long (nullable = true) ||||-- assists:

Re: large number of import-related function calls in PySpark profile

2015-09-03 Thread Davies Liu
I think this is not a problem of PySpark, you also saw this if you profile this script: ``` list(map(map_, range(sc.defaultParallelism))) ``` 81777/808740.0860.0000.3600.000 :2264(_handle_fromlist) On Thu, Sep 3, 2015 at 11:16 AM, Priedhorsky, Reid wrote: > >

Spark partitions from CassandraRDD

2015-09-03 Thread Alaa Zubaidi (PDF)
Hi, I testing Spark and Cassandra, Spark 1.4, Cassandra 2.1.7 cassandra spark connector 1.4, running in standalone mode. I am getting 4000 rows from Cassandra (4mb row), where the row keys are random. .. sc.cassandraTable[RES](keyspace,res_name).where(res_where).cache I am expecting that it

Re: Small File to HDFS

2015-09-03 Thread nibiau
HAR archive seems a good idea , but just a last question to be sure to do the best choice : - Is it possible to override (remove/replace) a file inside the HAR ? Basically the name of my small files will be the keys of my records , and sometimes I will need to replace the content of a file by a

Re: Spark partitions from CassandraRDD

2015-09-03 Thread Alaa Zubaidi (PDF)
Thanks Ankur, But I grabbed some keys from the Spark results and ran "nodetool -h getendpoints " and it showed the data is coming from at least 2 nodes? Regards, Alaa On Thu, Sep 3, 2015 at 12:06 PM, Ankur Srivastava < ankur.srivast...@gmail.com> wrote: > Hi Alaa, > > Partition when

Re: spark 1.4.1 - LZFException

2015-09-03 Thread Yadid Ayzenberg
Hi Akhil, No, it seems I have plenty of more disk space available on that node. I look at the logs and one minute before that exception I am seeing the following exception. 15/09/03 12:51:39 ERROR TransportChannelHandler: Connection to /x.y.z.w:44892 has been quiet for 12 ms while there

RE: How to Take the whole file as a partition

2015-09-03 Thread Shuai Zheng
Hi, Will there any way to change the default split size when load data for Spark? By default it is 64M, I know how to change this in Hadoop Mapreduce, but not sure how to do this in Spark. Regards, Shuai From: Tao Lu [mailto:taolu2...@gmail.com] Sent: Thursday, September 03,

Re: Ranger-like Security on Spark

2015-09-03 Thread Daniel Schulz
Hi Matei, Thanks for your answer. My question is regarding simple authenticated Spark-on-YARN only, without Kerberos. So when I run Spark on YARN and HDFS, Spark will pass through my HDFS user and only be able to access files I am entitled to read/write? Will it enforce HDFS ACLs and Ranger

Re: Parquet partitioning for unique identifier

2015-09-03 Thread Kohki Nishio
let's say I have a data like htis ID | Some1 | Some2| Some3 | A1 | kdsfajfsa | dsafsdafa | fdsfafa | A2 | dfsfafasd | 23jfdsjkj | 980dfs | A3 | 99989df | jksdljas | 48dsaas | .. Z00.. | fdsafdsfa | fdsdafdas | 89sdaff | My understanding is that if I

Re: spark-submit not using conf/spark-defaults.conf

2015-09-03 Thread Davies Liu
I think it's a missing feature. On Wed, Sep 2, 2015 at 10:58 PM, Axel Dahl wrote: > So a bit more investigation, shows that: > > if I have configured spark-defaults.conf with: > > "spark.files library.py" > > then if I call > > "spark-submit.py -v test.py" > > I

Re: Resource allocation issue - is it possible to submit a new job in existing application under a different user?

2015-09-03 Thread Steve Loughran
If its running the thrift server from hive, it's got a SQL API for you to connect to... On 3 Sep 2015, at 17:03, Dhaval Patel > wrote: I am accessing a shared cluster mode Spark environment. However, there is an existing application

Resource allocation issue - is it possible to submit a new job in existing application under a different user?

2015-09-03 Thread Dhaval Patel
I am accessing a shared cluster mode Spark environment. However, there is an existing application (SparkSQL/Thrift Server), running under a different user, that occupies all available cores. Please see attached screenshot to get an idea about current resource utilization. Is there a way I can use

Re: Exceptions in threads in executor code don't get caught properly

2015-09-03 Thread Wayne Song
Sorry, I guess my code and the exception didn't make it to the mailing list. Here's my code: def main(args: Array[String]) { val conf = new SparkConf().setAppName("Test app") val sc = new SparkContext(conf) val rdd = sc.parallelize(Array(1, 2, 3)) val rdd1 = rdd.map({x =>

Ranger-like Security on Spark

2015-09-03 Thread Daniel Schulz
Hi, I really enjoy using Spark. An obstacle to sell it to our clients currently is the missing Kerberos-like security on a Hadoop with simple authentication. Are there plans, a proposal, or a project to deliver a Ranger plugin or something similar to Spark. The target is to differentiate users

VaryMax Rotation and other questions for PCA in Spark MLLIB

2015-09-03 Thread Behzad Altaf
Hi All, Hope you are doing good. We are using Spark MLLIB (1.4.1) PCA functionality for dimensionality reduction. So far we are able to condense n features into k features using https://spark.apache.org/docs/1.4.1/mllib-dimensionality-reduction.html#principal-component-analysis-pca The

RE: Unbale to run Group BY on Large File

2015-09-03 Thread SAHA, DEBOBROTA
Hi Silvio, I am trying to group the data from a Oracle RAW table by loading the raw table into a RDD first and the registering that as a table in SAPRK. Thanks, Debobrota From: Silvio Fiorito [mailto:silvio.fior...@granturing.com] Sent: Wednesday, September 02, 2015 5:03 PM To: SAHA,

Re: pySpark window functions are not working in the same way as Spark/Scala ones

2015-09-03 Thread Davies Liu
This is an known but in 1.4.1, fixed in 1.4.2 and 1.5 (both are not released yet). On Thu, Sep 3, 2015 at 7:41 AM, Sergey Shcherbakov wrote: > Hello all, > > I'm experimenting with Spark 1.4.1 window functions > and have come to a problem in pySpark that I've

Re: Small File to HDFS

2015-09-03 Thread Jörn Franke
Har is transparent and hardly any performance overhead. You may decide not to compress or use a fast compression algorithm, such as snappy (recommended) Le jeu. 3 sept. 2015 à 16:17, a écrit : > My main question in case of HAR usage is , is it possible to use Pig on it > and

Re: large number of import-related function calls in PySpark profile

2015-09-03 Thread Davies Liu
The slowness in PySpark may be related to searching path added by PySpark, could you show the sys.path? On Thu, Sep 3, 2015 at 1:38 PM, Priedhorsky, Reid wrote: > > On Sep 3, 2015, at 12:39 PM, Davies Liu wrote: > > I think this is not a problem of

Re: Resource allocation issue - is it possible to submit a new job in existing application under a different user?

2015-09-03 Thread Dhaval Patel
Yes you're right and I can connect it through Tableau etc. tools but don't know how I can connect from shell where I can submit more jobs to this application. Any insight on how can I connect using shell? On Thu, Sep 3, 2015 at 1:39 PM, Steve Loughran wrote: > If its

Re: Hbase Lookup

2015-09-03 Thread ayan guha
Hi Thanks for your comments. My driving point is instead of loading Hbase data entirely I want to process record by record lookup and that is best done in UDF or map function. I also would loved to do it in Spark but no production cluster yet here :( @Franke: I do not have enough competency on

Re: Small File to HDFS

2015-09-03 Thread Jörn Franke
Well it is the same as in normal hdfs, delete file and put a new one with the same name works. Le jeu. 3 sept. 2015 à 21:18, a écrit : > HAR archive seems a good idea , but just a last question to be sure to do > the best choice : > - Is it possible to override (remove/replace)

Re: large number of import-related function calls in PySpark profile

2015-09-03 Thread Priedhorsky, Reid
On Sep 3, 2015, at 12:39 PM, Davies Liu > wrote: I think this is not a problem of PySpark, you also saw this if you profile this script: ``` list(map(map_, range(sc.defaultParallelism))) ``` 81777/808740.0860.0000.3600.000

spark-shell does not see conf folder content on emr-4

2015-09-03 Thread Alexander Pivovarov
Hi Everyone My question is specific to running spark-1.4.1 on emr-4.0.0 spark installed to /usr/lib/spark conf folder linked to /etc/spark/conf spark-shell location /usr/bin/spark-shell I noticed that if I run spark-shell it does not read /etc/spark/conf folder files (e.g. spark-env.sh and

Re: Parquet partitioning for unique identifier

2015-09-03 Thread Adrien Mogenet
Any code / Parquet schema to provide? I'm not sure to understand which step fails right there... On 3 September 2015 at 04:12, Raghavendra Pandey < raghavendra.pan...@gmail.com> wrote: > Did you specify partitioning column while saving data.. > On Sep 3, 2015 5:41 AM, "Kohki Nishio"

RE: FlatMap Explanation

2015-09-03 Thread Zalzberg, Idan (Agoda)
Hi, Yes, I can explain 1 to 3 -> 1,2,3 2 to 3- > 2,3 3 to 3 -> 3 3 to 3 -> 3 Flat map that concatenates the results, so you get 1,2,3, 2,3, 3,3 You should get the same with any scala collection Cheers From: Ashish Soni [mailto:asoni.le...@gmail.com] Sent: Thursday, September 03, 2015 9:06 AM

Re: Slow Mongo Read from Spark

2015-09-03 Thread Akhil Das
On SSD you will get around 30-40MB/s on a single machine (on 4 cores). Thanks Best Regards On Mon, Aug 31, 2015 at 3:13 PM, Deepesh Maheshwari < deepesh.maheshwar...@gmail.com> wrote: > tried it,,gives the same above exception > > Exception in thread "main" java.io.IOException: No FileSystem

Re: Managing httpcomponent dependency in Spark/Solr

2015-09-03 Thread Igor Berman
not sure if it will help, but have you checked https://maven.apache.org/plugins/maven-shade-plugin/examples/class-relocation.html On 31 August 2015 at 19:33, Oliver Schrenk wrote: > Hi, > > We are running a distibuted indexing service for Solr (4.7) on a Spark > (1.2)

Re: Connection closed error while running Terasort

2015-09-03 Thread Akhil Das
Can you look at bit deeper in the executor logs? It may happen that it hit the GC Overhead etc which lead to the connection failures. Thanks Best Regards On Tue, Sep 1, 2015 at 5:43 AM, Suman Somasundar < suman.somasun...@oracle.com> wrote: > Hi, > > > > I am getting the following error while

Re: Error using spark.driver.userClassPathFirst=true

2015-09-03 Thread Akhil Das
Its messing up your classpath, there was a discussion happened here previously https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/spark-on-yarn-java-lang-UnsatisfiedLinkError-NativeCodeLoader/td-p/22724 Thanks Best Regards On Tue, Sep 1, 2015 at 4:58 PM, cgalan

Re: Managing httpcomponent dependency in Spark/Solr

2015-09-03 Thread Akhil Das
You can also set the *executor.userClassPathFirst,* There are couple of classpath configurations available to override defaults, you can find them from here http://spark.apache.org/docs/latest/configuration.html#runtime-environment Thanks Best Regards On Mon, Aug 31, 2015 at 10:03 PM, Oliver

Re: Exceptions in threads in executor code don't get caught properly

2015-09-03 Thread Akhil Das
[image: Inline image 1] I'm not able to find the piece of code that you wrote, but you can use a try...catch to catch your user specific exceptions and log it in the logs. Something like this: myRdd.map(x => try{ //something }catch{ case e:Exception => log.error("Whoops!! :" + e) }) Thanks

Re: Small File to HDFS

2015-09-03 Thread nibiau
Ok but so some questions : - Sometimes I have to remove some messages from HDFS (cancel/replace cases) , is it possible ? - In the case of a big zip file, is it possible to easily process Pig on it directly ? Tks Nicolas - Mail original - De: "Tao Lu" À:

Fwd: Code generation for GPU

2015-09-03 Thread kiran lonikar
Hi, I am speaking in Spark Europe summit on exploiting GPUs for columnar DataFrame operations . I was going through various blogs, talks and JIRAs given by all the key spark folks and trying to figure out

Re: FlatMap Explanation

2015-09-03 Thread Ashish Soni
Thanks a lot everyone. Very Helpful. Ashish On Thu, Sep 3, 2015 at 2:19 AM, Zalzberg, Idan (Agoda) < idan.zalzb...@agoda.com> wrote: > Hi, > > Yes, I can explain > > > > 1 to 3 -> 1,2,3 > > 2 to 3- > 2,3 > > 3 to 3 -> 3 > > 3 to 3 -> 3 > > > > Flat map that concatenates the results, so you get

INDEXEDRDD in PYSPARK

2015-09-03 Thread shahid ashraf
Hi Folks Any resource to get started using https://github.com/amplab/spark-indexedrdd in pyspark -- with Regards Shahid Ashraf

Re: Small File to HDFS

2015-09-03 Thread Ndjido Ardo Bar
Hi Nibiau, Hbase seems to be a good solution to your problems. As you may know storing yours messages as a key-value pairs in Hbase saves you the overhead of manually resizing blocks of data using zip files. The added advantage along with the fact that Hbase uses HDFS for storage, is the

Re: Small File to HDFS

2015-09-03 Thread Ted Yu
Agree with Ado. API provided by hbase is versatile. There is checkAndPut as well. Cheers > On Sep 3, 2015, at 5:00 AM, Ndjido Ardo Bar wrote: > > Hi Nibiau, > > Hbase seems to be a good solution to your problems. As you may know storing > yours messages as a key-value

Re: Running Examples

2015-09-03 Thread delbert
hi folks, i am running into the same issue. running the script (with sc. instead of spark. and filling in NUM_SAMPLES) works fine for me. running on Windows 10, admin PowerShell console, started with the command: ./bin/spark-shell --master local[4] delbert -- View this message in context:

Re: Hbase Lookup

2015-09-03 Thread Ted Yu
Ayan: Please read this: http://hbase.apache.org/book.html#cp Cheers On Thu, Sep 3, 2015 at 2:13 PM, ayan guha wrote: > Hi > > Thanks for your comments. My driving point is instead of loading Hbase > data entirely I want to process record by record lookup and that is best >

different Row objects?

2015-09-03 Thread Wei Chen
Hey Friends, Recently I have been using Spark 1.3.1, mainly pyspark.sql. I noticed that the Row object collected directly from a DataFrame is different from the Row object we directly defined from Row(*arg, **kwarg). >>>from pyspark.sql.types import Row >>>aaa = Row(a=1, b=2, c=Row(a=1, b=2))

Re: Ranger-like Security on Spark

2015-09-03 Thread Matei Zaharia
Even simple Spark-on-YARN should run as the user that submitted the job, yes, so HDFS ACLs should be enforced. Not sure how it plays with the rest of Ranger. Matei > On Sep 3, 2015, at 4:57 PM, Jörn Franke wrote: > > Well if it needs to read from hdfs then it will adhere

Re: Ranger-like Security on Spark

2015-09-03 Thread Marcelo Vanzin
On Thu, Sep 3, 2015 at 5:15 PM, Matei Zaharia wrote: > Even simple Spark-on-YARN should run as the user that submitted the job, > yes, so HDFS ACLs should be enforced. Not sure how it plays with the rest of > Ranger. It's slightly more complicated than that (without

Re: Ranger-like Security on Spark

2015-09-03 Thread Ruslan Dautkhanov
You could define access in Sentry and enable permissions sync with HDFS, so you could just grant access on Hive per-database or per-table basis. It should work for Spark too, as Sentry will propage "grants" to HDFS acls.

Does Spark.ml LogisticRegression assumes only Double valued features?

2015-09-03 Thread njoshi
Hi, I was looking at the `Spark 1.5` dataframe/row api and the implementation for the logistic regression

Re: different Row objects?

2015-09-03 Thread Davies Liu
This was fixed by 1.5, could you download 1.5-RC3 to test this? On Thu, Sep 3, 2015 at 4:45 PM, Wei Chen wrote: > Hey Friends, > > Recently I have been using Spark 1.3.1, mainly pyspark.sql. I noticed that > the Row object collected directly from a DataFrame is

repartition on direct kafka stream

2015-09-03 Thread Shushant Arora
1.Does repartitioning on direct kafka stream shuffles only the offsets or exact kafka messages across executors? Say I have a direct kafkastream directKafkaStream.repartition(numexecutors).mapPartitions(new FlatMapFunction>, String>(){ ... } Say originally I have

Re: repartition on direct kafka stream

2015-09-03 Thread Saisai Shao
Yes not the offset ranges, but the real data will be shuffled when you using repartition(). Thanks Saisai On Fri, Sep 4, 2015 at 12:42 PM, Shushant Arora wrote: > 1.Does repartitioning on direct kafka stream shuffles only the offsets or > exact kafka messages across

Re: spark-submit not using conf/spark-defaults.conf

2015-09-03 Thread Axel Dahl
logged it here: https://issues.apache.org/jira/browse/SPARK-10436 On Thu, Sep 3, 2015 at 10:32 AM, Davies Liu wrote: > I think it's a missing feature. > > On Wed, Sep 2, 2015 at 10:58 PM, Axel Dahl wrote: > > So a bit more investigation, shows

Re: Ranger-like Security on Spark

2015-09-03 Thread Jörn Franke
Well if it needs to read from hdfs then it will adhere to the permissions defined there And/or in ranger. However, I am not aware that you can protect dataframes, tables or streams in general in Spark. Le jeu. 3 sept. 2015 à 21:47, Daniel Schulz a écrit : > Hi

Parsing Avro from Kafka Message

2015-09-03 Thread Daniel Haviv
Hi, I'm reading messages from Kafka where the value is an avro file. I would like to parse the contents of the message and work with it as a DataFrame, like with the spark-avro package but instead of files, pass it a RDD. How can this be achieved ? Thank you. Daniel

Re: Input size increasing every iteration of gradient boosted trees [1.4]

2015-09-03 Thread Peter Rudenko
Confirm, having the same issue (1.4.1 mllib package). For smaller dataset accuracy degradeted also. Haven’t tested yet in 1.5 with ml package implementation. |val boostingStrategy = BoostingStrategy.defaultParams("Classification") boostingStrategy.setNumIterations(30)

RE: Problem while loading saved data

2015-09-03 Thread Ewan Leith
Your error log shows you attempting to read from 'people.parquet2' not ‘people.parquet’ as you’ve put below, is that just from a different attempt? Otherwise, it’s an odd one! There aren’t _SUCCESS, _common_metadata and _metadata files under people.parquet that you’ve listed below, which would

RE: spark 1.4.1 saveAsTextFile (and Parquet) is slow on emr-4.0.0

2015-09-03 Thread Ewan Leith
For those who have similar issues on EMR writing Parquet files, if you update mapred-site.xml with the following lines: mapred.output.direct.EmrFileSystemtrue mapred.output.direct.NativeS3FileSystemtrue parquet.enable.summary-metadatafalse

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-03 Thread Tathagata Das
Are you accidentally recovering from checkpoint files which has 10 second as the batch interval? On Thu, Sep 3, 2015 at 7:34 AM, Dmitry Goldenberg wrote: > I'm seeing an oddity where I initially set the batchdurationmillis to 1 > second and it works fine: > >

DataFrame creation delay?

2015-09-03 Thread Isabelle Phan
Hello, I am using SparkSQL to query some Hive tables. Most of the time, when I create a DataFrame using sqlContext.sql("select * from table") command, DataFrame creation is less than 0.5 second. But I have this one table with which it takes almost 12 seconds! scala> val start =

Re: How to determine the value for spark.sql.shuffle.partitions?

2015-09-03 Thread Isabelle Phan
+1 I had the exact same question as I am working on my first Spark applications. Hope someone can share some best practices. Thanks! Isabelle On Tue, Sep 1, 2015 at 2:17 AM, Romi Kuntsman wrote: > Hi all, > > The number of partition greatly affect the speed and efficiency