optimal way to load parquet files with partition

2016-02-02 Thread Wei Chen
Hi All, I have data partitioned by year=/month=mm/day=dd, what is the best way to get two months of data from a given year (let's say June and July)? Two ways I can think of: 1. use unionAll df1 = sqc.read.parquet('xxx/year=2015/month=6') df2 = sqc.read.parquet('xxx/year=2015/month=7') df =

Re: optimal way to load parquet files with partition

2016-02-02 Thread Michael Armbrust
It depends how many partitions you have and if you are only doing a single operation. Loading all the data and filtering will require us to scan the directories to discover all the months. This information will be cached. Then we should prune and avoid reading unneeded data. Option 1 does not

Re: [MLlib] What is the best way to forecast the next month page visit?

2016-02-02 Thread diplomatic Guru
Hi Jorge, Unfortunately, I couldn't transform the data as you suggested. This is what I get: +---+-+-+ | id|pageIndex| pageVec| +---+-+-+ |0.0| 3.0|(3,[],[])| |1.0| 0.0|(3,[0],[1.0])| |2.0| 2.0|(3,[2],[1.0])| |3.0|

Re: [ANNOUNCE] New SAMBA Package = Spark + AWS Lambda

2016-02-02 Thread David Russell
Hi Ben, > My company uses Lamba to do simple data moving and processing using python > scripts. I can see using Spark instead for the data processing would make it > into a real production level platform. That may be true. Spark has first class support for Python which should make your life

RE: try to read multiple bz2 files in s3

2016-02-02 Thread Lin, Hao
Hi Xiangrui, For the following problem, I found out an issue ticket you posted before https://issues.apache.org/jira/browse/HADOOP-10614 I wonder if this has been fixed in Spark 1.5.2 which I believe so. Any suggestion on how to fix it? Thanks Hao From: Lin, Hao [mailto:hao@finra.org]

Spark 1.5.2 memory error

2016-02-02 Thread Stefan Panayotov
Hi Guys, I need help with Spark memory errors when executing ML pipelines. The error that I see is: 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 32.0 in stage 32.0 (TID 3298) 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 12.0 in stage 32.0 (TID 3278)

Re: Spark 1.5.2 memory error

2016-02-02 Thread Jakob Odersky
Can you share some code that produces the error? It is probably not due to spark but rather the way data is handled in the user code. Does your code call any reduceByKey actions? These are often a source for OOM errors. On Tue, Feb 2, 2016 at 1:22 PM, Stefan Panayotov wrote:

Error trying to get DF for Hive table stored HBase

2016-02-02 Thread Doug Balog
I’m trying to create a DF for an external Hive table that is in HBase. I get the a NoSuchMethodError

RE: Spark 1.5.2 memory error

2016-02-02 Thread Stefan Panayotov
For the memoryOvethead I have the default of 10% of 16g, and Spark version is 1.5.2. Stefan Panayotov, PhD Sent from Outlook Mail for Windows 10 phone From: Ted Yu Sent: Tuesday, February 2, 2016 4:52 PM To: Jakob Odersky Cc: Stefan Panayotov; user@spark.apache.org Subject: Re: Spark 1.5.2

Re: Spark Pattern and Anti-Pattern

2016-02-02 Thread Lars Albertsson
Querying a service or a database from a Spark job is in most cases an anti-pattern, but there are exceptions. The jobs become unstable and indeterministic by relying on a live database. The recommended pattern is to take regular dumps of the database to your cluster storage, e.g. HDFS, and join

Re: Spark 1.5.2 memory error

2016-02-02 Thread Ted Yu
What value do you use for spark.yarn.executor.memoryOverhead ? Please see https://spark.apache.org/docs/latest/running-on-yarn.html for description of the parameter. Which Spark release are you using ? Cheers On Tue, Feb 2, 2016 at 1:38 PM, Jakob Odersky wrote: > Can you

[Spark 1.5+] ReceiverTracker seems not to stop Kinesis receivers

2016-02-02 Thread Roberto Coluccio
Hi, I'm struggling around an issue ever since I tried to upgrade my Spark Streaming solution from 1.4.1 to 1.5+. I have a Spark Streaming app which creates 3 ReceiverInputDStreams leveraging KinesisUtils.createStream API. I used to leverage a timeout to terminate my app

RE: how to introduce spark to your colleague if he has no background about *** spark related

2016-02-02 Thread Mohammed Guller
Hi Charles, You may find slides 16-20 from this deck useful: http://www.slideshare.net/mg007/big-data-trends-challenges-opportunities-57744483 I used it for a talk that I gave to MS students last week. I wanted to give them some context before describing Spark. It doesn’t cover all the stuff

Re: Error trying to get DF for Hive table stored HBase

2016-02-02 Thread Ted Yu
Looks like this is related: HIVE-12406 FYI On Tue, Feb 2, 2016 at 1:40 PM, Doug Balog wrote: > I’m trying to create a DF for an external Hive table that is in HBase. > I get the a NoSuchMethodError >

Re: Spark 1.5.2 memory error

2016-02-02 Thread Jim Green
Look at part#3 in below blog: http://www.openkb.info/2015/06/resource-allocation-configurations-for.html You may want to increase the executor memory, not just the spark.yarn.executor.memoryOverhead. On Tue, Feb 2, 2016 at 2:14 PM, Stefan Panayotov wrote: > For the

Best way to process large number of (non-text) files in deeply nested folder hierarchy

2016-02-02 Thread Boris Capitanu
Hello, I find myself in need of being able to process a large number of files (28M) stored in a deeply nested folder hierarchy (Pairtree... a multi-level hashtable-on-disk -like structure). Here's an example path: ./udel/pairtree_root/31/74/11/11/56/89/39/3174568939/3174568939.zip I

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Jerry Lam
I think spark dataframe supports more than just SQL. It is more like pandas dataframe.( I rarely use the SQL feature. ) There are a lot of novelties in dataframe so I think it is quite optimize for many tasks. The in-memory data structure is very memory efficient. I just change a very slow RDD

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Michael Armbrust
> > A principal difference between RDDs and DataFrames/Datasets is that the > latter have a schema associated to them. This means that they support only > certain types (primitives, case classes and more) and that they are > uniform, whereas RDDs can contain any serializable object and must not >

recommendProductsForUser for a subset of user

2016-02-02 Thread Roberto Pagliari
When using ALS, is it possible to use recommendProductsForUser for a subset of users? Currently, productFeatures and userFeatures are val. Is there a workaround for it? Using recommendForUser repeatedly would not work in my case, since it would be too slow with many users. Thank you,

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Nirav Patel
Sure, having a common distributed query and compute engine for all kind of data source is alluring concept to market and advertise and to attract potential customers (non engineers, analyst, data scientist). But it's nothing new!..but darn old school. it's taking bits and pieces from existing sql

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Jakob Odersky
To address one specific question: > Docs says it usues sun.misc.unsafe to convert physical rdd structure into byte array at some point for optimized GC and memory. My question is why is it only applicable to SQL/Dataframe and not RDD? RDD has types too! A principal difference between RDDs and

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Jerry Lam
Hi Michael, Is there a section in the spark documentation demonstrate how to serialize arbitrary objects in Dataframe? The last time I did was using some User Defined Type (copy from VectorUDT). Best Regards, Jerry On Tue, Feb 2, 2016 at 8:46 PM, Michael Armbrust

make-distribution fails due to wrong order of modules

2016-02-02 Thread Koert Kuipers
i am seeing make-distribution fail because lib_managed does not exist. what seems to happen is that sql/hive module gets build and creates this directory. but after this sometime later module spark-parent gets build, which includes: [INFO] Building Spark Project Parent POM 1.6.0-SNAPSHOT [INFO]

how to calculate -- executor-memory,num-executors,total-executor-cores

2016-02-02 Thread Divya Gehlot
Hi, I would like to know how to calculate how much -executor-memory should we allocate , how many num-executors,total-executor-cores we should give while submitting spark jobs . Is there any formula for it ? Thanks, Divya

Re: Union of RDDs without the overhead of Union

2016-02-02 Thread Koert Kuipers
well the "hadoop" way is to save to a/b and a/c and read from a/* :) On Tue, Feb 2, 2016 at 11:05 PM, Jerry Lam wrote: > Hi Spark users and developers, > > anyone knows how to union two RDDs without the overhead of it? > > say rdd1.union(rdd2).saveTextFile(..) > This

Re: Union of RDDs without the overhead of Union

2016-02-02 Thread Koert Kuipers
i am surprised union introduces a stage. UnionRDD should have only narrow dependencies. On Tue, Feb 2, 2016 at 11:25 PM, Koert Kuipers wrote: > well the "hadoop" way is to save to a/b and a/c and read from a/* :) > > On Tue, Feb 2, 2016 at 11:05 PM, Jerry Lam

Overriding toString and hashCode with Spark streaming

2016-02-02 Thread N B
Hello, In our Spark streaming application, we are forming DStreams made of objects a rather large composite class. I have discovered that in order to do some operations like RDD.subtract(), they are only successful for complex objects such as these by overriding toString() and hashCode() methods

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Nirav Patel
Hi Jerry, Yes I read that benchmark. And doesn't help in most cases. I'll give you example of one of our application. It's a memory hogger by nature since it works on groupByKey and performs combinatorics on Iterator. So it maintain few structures inside task. It works on mapreduce with half the

Re: Union of RDDs without the overhead of Union

2016-02-02 Thread Rishi Mishra
Agree with Koert that UnionRDD should have a narrow dependencies . Although union of two RDDs increases the number of tasks to be executed ( rdd1.partitions + rdd2.partitions) . If your two RDDs have same number of partitions , you can also use zipPartitions, which causes lesser number of tasks,

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Koert Kuipers
with respect to joins, unfortunately not all implementations are available. for example i would like to use joins where one side is streaming (and the other cached). this seems to be available for DataFrame but not for RDD. On Wed, Feb 3, 2016 at 12:19 AM, Nirav Patel

Re: [MLLib] Is the order of the coefficients in a LogisticRegresionModel kept ?

2016-02-02 Thread Yanbo Liang
For you case, it's true. But not always correct for a pipeline model, some transformers in pipeline will change the features such as OneHotEncoder. 2016-02-03 1:21 GMT+08:00 jmvllt : > Hi everyone, > > This may sound like a stupid question but I need to be sure of this

Re: Getting the size of a broadcast variable

2016-02-02 Thread Ted Yu
There is chance that the log message may change in future releases. Log snooping would be broken. FYI On Mon, Feb 1, 2016 at 9:55 PM, Takeshi Yamamuro wrote: > Hi, > > Currently, there is no way to check the size except for snooping INFO-logs > in a driver; > > 16/02/02

RE: can we do column bind of 2 dataframes in spark R? similar to cbind in R?

2016-02-02 Thread Sun, Rui
Devesh, The cbind-like operation is not supported by Scala DataFrame API, so it is also not supported in SparkR. You may try to workaround this by trying the approach in http://stackoverflow.com/questions/32882529/how-to-zip-twoor-more-dataframe-in-spark You could also submit a JIRA

Spark saveAsHadoopFile stage fails with ExecutorLostfailure

2016-02-02 Thread Prabhu Joseph
Hi All, Spark job stage having saveAsHadoopFile fails with ExecutorLostFailure whenever the Executor is run with more cores. The stage is not memory intensive, executor has 20GB memory. for example, 6 executors each with 6 cores, ExecutorLostFailure happens 10 executors each with 2 cores,

Re: Spark Streaming with Kafka - batch DStreams in memory

2016-02-02 Thread Cody Koeninger
It's possible you could (ab)use updateStateByKey or mapWithState for this. But honestly it's probably a lot more straightforward to just choose a reasonable batch size that gets you a reasonable file size for most of your keys, then use filecrush or something similar to deal with the hdfs small

Re: Guidelines for writing SPARK packages

2016-02-02 Thread Praveen Devarao
Thanks David. I am looking at extending the SparkSQL library with a custom package...hence was looking at more from details on any specific classes to be extended or implement (with) to achieve the redirect of calls to my module (when using .format). If you have any info on these lines do

Spark Streaming:Could not compute split

2016-02-02 Thread aafri
Hi, We run Spark Streaming on YARN,the Streaming Driver restart very often. I don't known what's the matter. The exception is below: 16/02/01 18:55:14 ERROR scheduler.JobScheduler: Error running job streaming job 1454324113000 ms.0 org.apache.spark.SparkException: Job aborted due to stage

MLLib embedded dependencies

2016-02-02 Thread Valentin Popov
Hello every one. I have a some trouble to run word2vec, and run the libs… Is it possible to use spark MLLib as embedded library (like mllib.jar + spark-core.jar) inside Tomcat application (it is already has hadoop libs)? By default it is huge in one jar contains all dependencies and after

RE: try to read multiple bz2 files in s3

2016-02-02 Thread Lin, Hao
Hi Robert, I just use textFile. Here is the simple code: val fs3File=sc.textFile("s3n://my bucket/myfolder/") fs3File.count do you suggest I should use sc.parallelize? many thanks From: Robert Collich [mailto:rcoll...@gmail.com] Sent: Monday, February 01, 2016 6:54 PM To: Lin, Hao; user

[MLLib] Is the order of the coefficients in a LogisticRegresionModel kept ?

2016-02-02 Thread jmvllt
Hi everyone, This may sound like a stupid question but I need to be sure of this : Given a dataframe composed by « n » features : f1, f2, …, fn For each row of my dataframe, I create a labeled point : val row_i = LabeledPoint(label, Vectors.dense(v1_i,v2_i,…, vn_i) ) where v1_i,v2_i,…, vn_i

Re: Master failover results in running job marked as "WAITING"

2016-02-02 Thread Ted Yu
bq. Failed to connect to master XXX:7077 Is the 'XXX' above the hostname for the new master ? Thanks On Tue, Feb 2, 2016 at 1:48 AM, Anthony Tang wrote: > Hi - > > I'm running Spark 1.5.2 in standalone mode with multiple masters using > zookeeper for failover. The

Re: [ANNOUNCE] New SAMBA Package = Spark + AWS Lambda

2016-02-02 Thread Benjamin Kim
Hi David, My company uses Lamba to do simple data moving and processing using python scripts. I can see using Spark instead for the data processing would make it into a real production level platform. Does this pave the way into replacing the need of a pre-instantiated cluster in AWS or bought

Re: how to calculate -- executor-memory,num-executors,total-executor-cores

2016-02-02 Thread Jia Zou
Divya, According to my recent Spark tuning experiences, optimal executor-memory size not only depends on your workload characteristics (e.g. working set size at each job stage) and input data size, but also depends on your total available memory and memory requirements of other components like

Re: Dynamic sql in Spark 1.5

2016-02-02 Thread Divya Gehlot
Hi, I have data set like : Dataset 1 HeaderCol1 HeadCol2 HeadCol3 dataset 1 dataset2 dataset 3 dataset 11 dataset13 dataset 13 dataset 21 dataset22 dataset 23 Datset 2 HeadColumn1 HeadColumn2HeadColumn3 HeadColumn4 Tag1 Dataset1 Tag2 Dataset1

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Nirav Patel
so latest optimizations done on spark 1.4 and 1.5 releases are mostly from project Tungsten. Docs says it usues sun.misc.unsafe to convert physical rdd structure into byte array at some point for optimized GC and memory. My question is why is it only applicable to SQL/Dataframe and not RDD? RDD

Spark 1.5.2 - are new Project Tungsten optimizations available on RDD as well?

2016-02-02 Thread Nirav Patel
Hi, I read about release notes and few slideshares on latest optimizations done on spark 1.4 and 1.5 releases. Part of which are optimizations from project Tungsten. Docs says it usues sun.misc.unsafe to convert physical rdd structure into byte array before shuffle for optimized GC and memory. My

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Koert Kuipers
Dataset will have access to some of the catalyst/tungsten optimizations while also giving you scala and types. However that is currently experimental and not yet as efficient as it could be. On Feb 2, 2016 7:50 PM, "Nirav Patel" wrote: > Sure, having a common distributed

Re: Dynamic sql in Spark 1.5

2016-02-02 Thread Ali Tajeldin EDU
While you can construct the SQL string dynamically in scala/java/python, it would be best to use the Dataframe API for creating dynamic SQL queries. See http://spark.apache.org/docs/1.5.2/sql-programming-guide.html for details. On Feb 2, 2016, at 6:49 PM, Divya Gehlot

Re: Master failover results in running job marked as "WAITING"

2016-02-02 Thread Anthony Tang
blockquote, div.yahoo_quoted { margin-left: 0 !important; border-left:1px #715FFA solid !important; padding-left:1ex !important; background-color:white !important; } Yes, it's the IP address/host. Sent from Yahoo Mail for iPad On Tuesday, February 2, 2016, 8:04 AM, Ted Yu

Union of RDDs without the overhead of Union

2016-02-02 Thread Jerry Lam
Hi Spark users and developers, anyone knows how to union two RDDs without the overhead of it? say rdd1.union(rdd2).saveTextFile(..) This requires a stage to union the 2 rdds before saveAsTextFile (2 stages). Is there a way to skip the union step but have the contents of the two rdds save to the

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Nirav Patel
I dont understand why one thinks RDD of case object doesn't have types(schema) ? If spark can convert RDD to DataFrame which means it understood the schema. SO then from that point why one has to use SQL features to do further processing? If all spark need for optimizations is schema then what

question on spark.streaming.kafka.maxRetries

2016-02-02 Thread Chen Song
For Kafka direct stream, is there a way to set the time between successive retries? From my testing, it looks like it is 200ms. Any way I can increase the time?

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Jerry Lam
Hi Nirav, I'm sure you read this? https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html There is a benchmark in the article to show that dataframe "can" outperform RDD implementation by 2 times. Of course, benchmarks can be "made". But from the

Dynamic sql in Spark 1.5

2016-02-02 Thread Divya Gehlot
Hi, Does Spark supports dyamic sql ? Would really appreciate the help , if any one could share some references/examples. Thanks, Divya