Re: Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-07-01 Thread Koert Kuipers
valid functions can be written for reduce and merge when the zero is null. so not being able to provide null as the initial value is something troublesome. i guess the proper way to do this is use Option, and have the None be the zero, which is what i assumed you did? unfortunately last time i

Re: spark parquet too many small files ?

2016-07-01 Thread kali.tumm...@gmail.com
I found the jira for the issue will there be a fix in future ? or no fix ? https://issues.apache.org/jira/browse/SPARK-6221 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264p27267.html Sent from the Apache Spark

Re: spark parquet too many small files ?

2016-07-01 Thread kali.tumm...@gmail.com
Hi Neelesh, I told you in my emails it's not spark-Scala application , I am working on just spark SQL. I am launching spark-SQL shell and running my hive code inside spark SQL she'll. Spark SQL she'll accepts functions which relate to spark SQL doesn't accepts fictions like collasece which is

Re: spark parquet too many small files ?

2016-07-01 Thread nsalian
Hi Sri, Thanks for the question. You can simply start by doing this in the initial stage: val sqlContext = new SQLContext(sc) val customerList = sqlContext.read.json(args(0)).coalesce(20) //using a json example here where the argument is the path to the file(s). This will reduce the partitions.

spark parquet too many small files ?

2016-07-01 Thread kali.tumm...@gmail.com
Hi All, I am running hive in spark-sql in yarn client mode, the sql is pretty simple load dynamic partitions to target parquet table. I used hive configurations parameters such as (set hive.merge.smallfiles.avgsize=25600;set hive.merge.size.per.task=256000;) which usually merges small

Enforcing shuffle hash join

2016-07-01 Thread Lalitha MV
Hi, In order to force broadcast hash join, we can set the spark.sql.autoBroadcastJoinThreshold config. Is there a way to enforce shuffle hash join in spark sql? Thanks, Lalitha

Re: Spark 2.0.0-preview ... problem with jackson core version

2016-07-01 Thread Charles Allen
I'm having the same difficulty porting https://github.com/metamx/druid-spark-batch/tree/spark2 over to spark2.x, where I have to go track down who is pulling in bad jackson versions. On Fri, Jul 1, 2016 at 11:59 AM Sean Owen wrote: > Are you just asking why you can't use

Re: Best way to merge final output part files created by Spark job

2016-07-01 Thread kali.tumm...@gmail.com
Try using collasece function to repartition to desired number of partitions files, to merge already output files use hive and insert overwrite table using below options. set hive.merge.smallfiles.avgsize=256; set hive.merge.size.per.task=256; set -- View this message in context:

Re: output part files max size

2016-07-01 Thread kali.tumm...@gmail.com
I am not sure but you can use collasece function to reduce number of output files . Thanks Sri -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/output-part-files-max-size-tp17013p27262.html Sent from the Apache Spark User List mailing list archive at

Re: Spark driver assigning splits to incorrect workers

2016-07-01 Thread Ted Yu
I guess you extended some InputFormat for providing locality information. Can you share some code snippet ? Which non-distributed file system are you using ? Thanks On Fri, Jul 1, 2016 at 2:46 PM, Raajen wrote: > I would like to use Spark on a non-distributed file system

Spark driver assigning splits to incorrect workers

2016-07-01 Thread Raajen
I would like to use Spark on a non-distributed file system but am having trouble getting the driver to assign tasks to the workers that are local to the files. I have extended InputSplits to create my own version of FileSplits, so that each worker gets a bit more information than the default

Re: Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-07-01 Thread Amit Sela
Thanks for pointing that Koert! I understand now why zero() and not init(a: IN), though I still don't see a good reason to skip the aggregation if zero returns null. If the user did it, it's on him to take care of null cases in reduce/merge, but it opens-up the possibility to use the input to

Re: Thrift JDBC server - why only one per machine and only yarn-client

2016-07-01 Thread Egor Pahomov
What about yarn-cluster mode? 2016-07-01 11:24 GMT-07:00 Egor Pahomov : > Separate bad users with bad quires from good users with good quires. Spark > do not provide no scope separation out of the box. > > 2016-07-01 11:12 GMT-07:00 Jeff Zhang : > >> I

Re: Deploying ML Pipeline Model

2016-07-01 Thread Saurabh Sardeshpande
Hi Nick, Thanks for the answer. Do you think an implementation like the one in this article is infeasible in production for say, hundreds of queries per minute? https://www.codementor.io/spark/tutorial/building-a-web-service-with-apache-spark-flask-example-app-part2. The article uses Flask to

Re: Deploying ML Pipeline Model

2016-07-01 Thread Sean Owen
(The more core JPMML libs are Apache 2; OpenScoring is AGPL. We use JPMML in Spark and couldn't otherwise because the Affero license is not Apache compatible.) On Fri, Jul 1, 2016 at 8:16 PM, Nick Pentreath wrote: > I believe open-scoring is one of the well-known PMML

Re: Deploying ML Pipeline Model

2016-07-01 Thread Nick Pentreath
I believe open-scoring is one of the well-known PMML serving frameworks in Java land (https://github.com/jpmml/openscoring). One can also use the raw https://github.com/jpmml/jpmml-evaluator for embedding in apps. (Note the license on both of these is AGPL - the older version of JPMML used to be

Re: Spark 2.0.0-preview ... problem with jackson core version

2016-07-01 Thread Sean Owen
Are you just asking why you can't use 2.5.3 in your app? because Jackson isn't shaded, which is sort of the bad news. But just use 2.6.5 too, ideally. I don't know where 2.6.1 is coming from, but Spark doesn't use it. On Fri, Jul 1, 2016 at 5:48 PM, wrote: > In my project I

Re: Thrift JDBC server - why only one per machine and only yarn-client

2016-07-01 Thread Egor Pahomov
Separate bad users with bad quires from good users with good quires. Spark do not provide no scope separation out of the box. 2016-07-01 11:12 GMT-07:00 Jeff Zhang : > I think so, any reason you want to deploy multiple thrift server on one > machine ? > > On Fri, Jul 1, 2016 at

Re: Deploying ML Pipeline Model

2016-07-01 Thread Jacek Laskowski
Hi Nick, Thanks a lot for the exhaustive and prompt response! (In the meantime I watched a video about PMML to get a better understanding of the topic). What are the tools that could "consume" PMML exports (after running JPMML)? What tools would be the endpoint to deliver low-latency predictions

Re: Thrift JDBC server - why only one per machine and only yarn-client

2016-07-01 Thread Jeff Zhang
I think so, any reason you want to deploy multiple thrift server on one machine ? On Fri, Jul 1, 2016 at 10:59 AM, Egor Pahomov wrote: > Takeshi, of course I used different HIVE_SERVER2_THRIFT_PORT > Jeff, thanks, I would try, but from your answer I'm getting the

Re: Thrift JDBC server - why only one per machine and only yarn-client

2016-07-01 Thread Egor Pahomov
Takeshi, of course I used different HIVE_SERVER2_THRIFT_PORT Jeff, thanks, I would try, but from your answer I'm getting the feeling, that I'm trying some very rare case? 2016-07-01 10:54 GMT-07:00 Jeff Zhang : > This is not a bug, because these 2 processes use the same

Re: Thrift JDBC server - why only one per machine and only yarn-client

2016-07-01 Thread Jeff Zhang
This is not a bug, because these 2 processes use the same SPARK_PID_DIR which is /tmp by default. Although you can resolve this by using different SPARK_PID_DIR, I suspect you would still have other issues like port conflict. I would suggest you to deploy one spark thrift server per machine for

Re: Thrift JDBC server - why only one per machine and only yarn-client

2016-07-01 Thread Takeshi Yamamuro
As said earlier, how about changing a bound port by using env `HIVE_SERVER2_THRIFT_PORT`? // maropu On Fri, Jul 1, 2016 at 10:47 AM, Egor Pahomov wrote: > I get > > "org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as > process 28989. Stop it first." >

Re: Thrift JDBC server - why only one per machine and only yarn-client

2016-07-01 Thread Egor Pahomov
I get "org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 28989. Stop it first." Is it a bug? 2016-07-01 10:10 GMT-07:00 Jeff Zhang : > I don't think the one instance per machine is true. As long as you > resolve the conflict issue such as port

Re: Deploying ML Pipeline Model

2016-07-01 Thread Nick Pentreath
Generally there are 2 ways to use a trained pipeline model - (offline) batch scoring, and real-time online scoring. For batch (or even "mini-batch" e.g. on Spark streaming data), then yes certainly loading the model back in Spark and feeding new data through the pipeline for prediction works just

Re: Deploying ML Pipeline Model

2016-07-01 Thread Jacek Laskowski
Hi Rishabh, I've just today had similar conversation about how to do a ML Pipeline deployment and couldn't really answer this question and more because I don't really understand the use case. What would you expect from ML Pipeline model deployment? You can save your model to a file by

Re: How are threads created in SQL Executor?

2016-07-01 Thread Takeshi Yamamuro
You mean `spark.sql.shuffle.partitions`? http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options // maropu On Fri, Jul 1, 2016 at 8:42 AM, emiretsk wrote: > It seems like threads are created by SQLExecution.withExecutionId, which is

Re: Thrift JDBC server - why only one per machine and only yarn-client

2016-07-01 Thread Jeff Zhang
I don't think the one instance per machine is true. As long as you resolve the conflict issue such as port conflict, pid file, log file and etc, you can run multiple instances of spark thrift server. On Fri, Jul 1, 2016 at 9:32 AM, Egor Pahomov wrote: > Hi, I'm using

Re: Spark 2.0.0-preview ... problem with jackson core version

2016-07-01 Thread ppatierno
In my project I found the library which brings Jackson core 2.6.5 and it is used in conjunction with the requested Jackson scala module 2.5.3 wanted by spark 2.0.0 preview. At runtime it's the cause of exception. Now I have excluded 2.6.5 using sbt but it could be dangerous for the other

Cluster mode deployment from jar in S3

2016-07-01 Thread Ashic Mahtab
Hello,I've got a Spark stand-alone cluster using EC2 instances. I can submit jobs using "--deploy-mode client", however using "--deploy-mode cluster" is proving to be a challenge. I've tries this: spark-submit --class foo --master spark:://master-ip:7077 --deploy-mode cluster

Thrift JDBC server - why only one per machine and only yarn-client

2016-07-01 Thread Egor Pahomov
Hi, I'm using Spark Thrift JDBC server and 2 limitations are really bother me - 1) One instance per machine 2) Yarn client only(not yarn cluster) Are there any architectural reasons for such limitations? About yarn-client I might understand in theory - master is the same process as a server, so

Re: Random Forest Classification

2016-07-01 Thread Rich Tarro
Hi Bryan. Thanks for your continued help. Here is the code shown in a Jupyter notebook. I figured this was easier that cutting and pasting the code into an email. If you would like me to send you the code in a different format let, me know. The necessary data is all downloaded within the

How are threads created in SQL Executor?

2016-07-01 Thread emiretsk
It seems like threads are created by SQLExecution.withExecutionId, which is called inside BroadcastExchangeExec.scala. When does the plan executor execute a BroadcaseExchange, and is there a way to control the number of threads? We have a job that writes DataFrames to an external DB, and it seems

Re: Deploying ML Pipeline Model

2016-07-01 Thread Silvio Fiorito
Hi Rishabh, My colleague, Richard Garris from Databricks, actually just gave a talk last night at the Bay Area Spark Meetup on ML model deployment. The slides and recording should be up soon, you should be able to find a link here: http://www.meetup.com/spark-users/events/231574440/ Thanks,

Re: HiveContext

2016-07-01 Thread Mich Talebzadeh
hi, In general if your ORC tables is not bucketed it is not going to do much. the idea is that using predicate pushdown you will only get the data from the partition concerned and avoid expensive table scans! Orc provides what is known as store index at file, stripe and rowset levels (default

Spark 2.0.0-preview ... problem with jackson core version

2016-07-01 Thread Paolo Patierno
Hi, developing a custom receiver up today I used spark version "2.0.0-SNAPSHOT" and scala version 2.11.7. With these version all tests work fine. I have just switching to "2.0.0-preview" as spark version but not I have following error : An exception or error caused a run to abort: class

Re: Deploying ML Pipeline Model

2016-07-01 Thread Steve Goodman
Hi Rishabh, I have a similar use-case and have struggled to find the best solution. As I understand it 1.6 provides pipeline persistence in Scala, and that will be expanded in 2.x. This project https://github.com/jpmml/jpmml-sparkml claims to support about a dozen pipeline transformers, and 6 or

Re: RDD to DataFrame question with JsValue in the mix

2016-07-01 Thread Dood
On 7/1/2016 6:42 AM, Akhil Das wrote: case class Holder(str: String, js:JsValue) Hello, Thanks! I tried that before posting the question to the list but I keep getting an error such as this even after the map() operation to convert (String,JsValue) -> Holder and then toDF(). I am simply

Deploying ML Pipeline Model

2016-07-01 Thread Rishabh Bhardwaj
Hi All, I am looking for ways to deploy a ML Pipeline model in production . Spark has already proved to be a one of the best framework for model training and creation, but once the ml pipeline model is ready how can I deploy it outside spark context ? MLlib model has toPMML method but today

Re: JavaStreamingContext.stop() hangs

2016-07-01 Thread chandan prakash
http://why-not-learn-something.blogspot.in/2016/05/apache-spark-streaming-how-to-do.html On Fri, Jul 1, 2016 at 1:42 PM, manoop wrote: > I have a Spark job and I just want to stop it on some condition. Once the > condition is met, I am calling JavaStreamingContext.stop(),

Re: Remote RPC client disassociated

2016-07-01 Thread Akhil Das
Can you try the Cassandra connector 1.5? It is also compatible with Spark 1.6 according to their documentation https://github.com/datastax/spark-cassandra-connector#version-compatibility You can also crosspost it over here

Re: How to spin up Kafka using docker and use for Spark Streaming Integration tests

2016-07-01 Thread Akhil Das
You can use this https://github.com/wurstmeister/kafka-docker to spin up a kafka cluster and then point your sparkstreaming to it to consume from it. On Fri, Jul 1, 2016 at 1:19 AM, SRK wrote: > Hi, > > I need to do integration tests using Spark Streaming. My idea is

RE: Remote RPC client disassociated

2016-07-01 Thread Joaquin Alzola
HI Akhil I am using: Cassandra: 3.0.5 Spark: 1.6.1 Scala 2.10 Spark-cassandra connector: 1.6.0 From: Akhil Das [mailto:ak...@hacked.work] Sent: 01 July 2016 11:38 To: Joaquin Alzola Cc: user@spark.apache.org Subject: Re: Remote RPC client disassociated This looks

Re: RDD to DataFrame question with JsValue in the mix

2016-07-01 Thread Akhil Das
Something like this? import sqlContext.implicits._ case class Holder(str: String, js:JsValue) yourRDD.map(x => Holder(x._1, x._2)).toDF() On Fri, Jul 1, 2016 at 3:36 AM, Dood@ODDO wrote: > Hello, > > I have an RDD[(String,JsValue)] that I want to convert into a

Re: Remote RPC client disassociated

2016-07-01 Thread Akhil Das
This looks like a version conflict, which version of spark are you using? The Cassandra connector you are using is for Scala 2.10x and Spark 1.6 version. On Thu, Jun 30, 2016 at 6:34 PM, Joaquin Alzola wrote: > HI List, > > > > I am launching this spark-submit job: >

HiveContext

2016-07-01 Thread manish jaiswal
Hi, Using sparkHiveContext when we read all rows where age was between 0 and 100, even though we requested rows where age was less than 15. Such full table scanning is an expensive operation. ORC avoids this type of overhead by using predicate push-down with three levels of built-in indexes

Re: How spark makes partition when we insert data using the Sql query, and how the permissions to the partitions is assigned.?

2016-07-01 Thread Mich Talebzadeh
Let us take this for a ride. Simple code. Reads from an existing of 22miilion rows stored as ORC and saves it as a Parquet val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc) HiveContext.sql("use oraclehadoop") val s = HiveContext.table("sales2") val sorted =

JavaStreamingContext.stop() hangs

2016-07-01 Thread manoop
I have a Spark job and I just want to stop it on some condition. Once the condition is met, I am calling JavaStreamingContext.stop(), but it just hangs. Does not move on to the next line, which is just a debug line. I expect it to come out. I already tried different variants of stop, that is,

How spark makes partition when we insert data using the Sql query, and how the permissions to the partitions is assigned.?

2016-07-01 Thread shiv4nsh
Hey guys I am using Apache Spark 1.5.2, and I am running the Sql query using the SQLContext and when I run the insert query it saves the data in partition (as expected). I am just curious and want to know how these partitions are made and how the permissions to these partition is assigned . Can