Re: Spark 1.5 Streaming and Kinesis

2015-10-20 Thread Jean-Baptiste Onofré
Hi Phil, thanks for the Jira, I will try to take a look asap. Regards JB On 10/19/2015 11:07 PM, Phil Kallos wrote: I am currently trying a few code changes to see if I can squash this error. I have created https://issues.apache.org/jira/browse/SPARK-11193 to track progress, hope that is

Re: Spark SQL Exception: Conf non-local session path expected to be non-null

2015-10-20 Thread Deenar Toraskar
This seems to be set using hive.exec.scratchdir, is that set? hdfsSessionPath = new Path(hdfsScratchDirURIString, sessionId); createPath(conf, hdfsSessionPath, scratchDirPermission, false, true); conf.set(HDFS_SESSION_PATH_KEY, hdfsSessionPath.toUri().toString()); On 20 October 2015

Re: Spark SQL Exception: Conf non-local session path expected to be non-null

2015-10-20 Thread Davies Liu
The thread-local things does not work well with PySpark, because the thread used by PySpark in JVM could change over time, SessionState could be lost. This should be fixed in master by https://github.com/apache/spark/pull/8909 On Mon, Oct 19, 2015 at 1:08 PM, YaoPau wrote:

Re: can I use Spark as alternative for gem fire cache ?

2015-10-20 Thread Deenar Toraskar
Kali >> can I cache a RDD in memory for a whole day ? as of I know RDD will get empty once the spark code finish executing (correct me if I am wrong). Spark can definitely be used as a replacement for in memory databases for certain use cases. Spark RDDs are not shared amongst contextss. You

Re: difference between rdd.collect().toMap to rdd.collectAsMap() ?

2015-10-20 Thread Adrian Tanase
If you look at the source code you’ll see that this is merely a convenience function on PairRDDs - only interesting detail is that it uses a mutable HashMap to optimize creating maps with many keys. That being said, .collect() is called anyway.

Is there a way to create multiple streams in spark streaming?

2015-10-20 Thread LinQili
Hi all,I wonder if there is a way to create some child streaming while using spark streaming?For example, I create a netcat main stream, read data from a socket, then create 3 different child streams on the main stream,in stream1, we do fun1 on the input data then print result to screen;in

difference between rdd.collect().toMap to rdd.collectAsMap() ?

2015-10-20 Thread kali.tumm...@gmail.com
Hi All, Is there any performance impact when I use collectAsMap on my RDD instead of rdd.collect().toMap ? I have a key value rdd and I want to convert to HashMap as far I know collect() is not efficient on large data sets as it runs on driver can I use collectAsMap instead is there any

RE: Spark opening to many connection with zookeeper

2015-10-20 Thread Amit Hora
One region -Original Message- From: "Ted Yu" Sent: ‎20-‎10-‎2015 15:01 To: "Amit Singh Hora" Cc: "user" Subject: Re: Spark opening to many connection with zookeeper How many regions do your table have ? Which hbase

Re: Is there a way to create multiple streams in spark streaming?

2015-10-20 Thread Gerard Maas
You can create as many functional derivates of your original stream by using transformations. That's exactly the model that Spark Streaming offers. In your example, that would become something like: val stream = ssc.socketTextStream("localhost", ) val stream1 = stream.map(fun1) val stream2 =

[spark1.5.1] HiveQl.parse throws org.apache.spark.sql.AnalysisException: null

2015-10-20 Thread Ayoub
Hello, when upgrading to spark 1.5.1 from 1.4.1 the following code crashed on runtime. It is mainly used to parse HiveQL queries and check that they are valid. package org.apache.spark.sql.hive val sql = "CREATE EXTERNAL TABLE IF NOT EXISTS `t`(`id` STRING, `foo` INT) PARTITIONED BY (year INT,

Re: Spark opening to many connection with zookeeper

2015-10-20 Thread Ted Yu
How many regions do your table have ? Which hbase release do you use ? Cheers On Tue, Oct 20, 2015 at 12:32 AM, Amit Singh Hora wrote: > Hi All , > > My spark job started reporting zookeeper errors after seeing the zkdumps > from Hbase master i realized that there are N

Re: flattening a JSON data structure

2015-10-20 Thread Nuno Carvalho
Hi Michael, thank you for your reply, I tried to use explode before, but didn't had much success. But I didn't find an example where it is used like you suggested. I applied your suggestion and it works like a charm! Thanks, Nuno Nuno Carvalho Software Engineer - Big Data & Analytics On 19

Re: Spark 1.5 Streaming and Kinesis

2015-10-20 Thread Jean-Baptiste Onofré
Hi Phil, did you see my comments in the Jira ? Can you provide an update in the Jira please ? Thanks ! Regards JB On 10/19/2015 11:07 PM, Phil Kallos wrote: I am currently trying a few code changes to see if I can squash this error. I have created

Re: Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-20 Thread Lan Jiang
As Francois pointed out, you are encountering a classic small file anti-pattern. One solution I used in the past is to wrap all these small binary files into a sequence file or avro file. For example, the avro schema can have two fields: filename: string and binaryname:byte[]. Thus your file is

Re: Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-20 Thread Sean Owen
coalesce without a shuffle? it shouldn't be an action. It just treats many partitions as one. On Tue, Oct 20, 2015 at 1:00 PM, t3l wrote: > > I have dataset consisting of 5 binary files (each between 500kb and > 2MB). They are stored in HDFS on a Hadoop cluster. The

Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-20 Thread t3l
I have dataset consisting of 5 binary files (each between 500kb and 2MB). They are stored in HDFS on a Hadoop cluster. The datanodes of the cluster are also the workers for Spark. I open the files as a RDD using sc.binaryFiles("hdfs:///path_to_directory").When I run the first action that

Concurrency issue in Streams of data

2015-10-20 Thread Priya Ch
Hi All, When processing streams of data (with batch inter val 1 sec), there is possible case of Concurrency issue. i.e two messages M1 and M2 (updated version of M1) with same key are processed by 2 threads in parallel. To resolve this concurrency issue, I am applying Hash Partitioner on RDD.

spark straggle task

2015-10-20 Thread Triones,Deng(vip.com)
Hi All We run an application with version 1.4.1 standalone mode. We saw two tasks in one stage which runs very slow seems it is hang. We know that the JobScheduler have the function to assign the straggle task to another node. But what we saw it does not reassign. So we want to know is there

Spark opening to many connection with zookeeper

2015-10-20 Thread Amit Singh Hora
Hi All , My spark job started reporting zookeeper errors after seeing the zkdumps from Hbase master i realized that there are N number of connection being made from the nodes where worker of spark are running i believe some how the connections are not getting closed that is leading to error

Re: JdbcRDD Constructor

2015-10-20 Thread satish chandra j
Hi Deenar, Thanks for your valuable inputs Here is a situation, if a Source Table does not have any such column(unique values,numeric and sequential) which is suitable as Partition Column to be specified for JDBCRDD Constructor or DataSource API.How to proceed further on this scenario and also

hbase refguide URL

2015-10-20 Thread Ted Yu
Hi, I couldn't access the following URL (404): http://hbase.apache.org/book.html The above is linked from http://hbase.apache.org Where can I find the refguide ? Thanks

Re: Spark opening to many connection with zookeeper

2015-10-20 Thread Ted Yu
Can you take a look at example 37 on page 225 of: http://hbase.apache.org/apache_hbase_reference_guide.pdf You can use the following method of Table: void put(List puts) throws IOException; After the put() returns, the connection is closed. Cheers On Tue, Oct 20, 2015 at 2:40 AM, Amit Hora

Re: How to change the compression format when using SequenceFileOutputFormat with Spark

2015-10-20 Thread ๏̯͡๏
Figured it out. Fix change the compression codec and set LD_LIBRARY_PATH. $ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/apache/hadoop/lib/native/ -sh-4.1$ ./bin/spark-shell import org.apache.hadoop.io.Text import org.codehaus.jackson.map.ObjectMapper import scala.collection.JavaConversions._

hive thriftserver and fair scheduling

2015-10-20 Thread Sadhan Sood
Hi All, Does anyone have fair scheduling working for them in a hive server? I have one hive thriftserver running and multiple users trying to run queries at the same time on that server using a beeline client. I see that a big query is stopping all other queries from making any progress. Is this

Re: hive thriftserver and fair scheduling

2015-10-20 Thread Michael Armbrust
Not the most obvious place in the docs... but this is probably helpful: https://spark.apache.org/docs/latest/sql-programming-guide.html#scheduling You likely want to put each user in their own pool. On Tue, Oct 20, 2015 at 11:55 AM, Sadhan Sood wrote: > Hi All, > > Does

Using spark in cluster mode

2015-10-20 Thread masoom alam
Dear all I want to setup spark in cluster mode. The problem is that each worker node is looking for a file to process.in its local directory.is it possible to setup some thing hdfs so that each worker node take its part of a file from hdfsany good tutorials for this? Thanks

multiple pyspark instances simultaneously (same time)

2015-10-20 Thread jeff.sadow...@gmail.com
I am having issues trying to setup spark to run jobs simultaneously. I thought I wanted FAIR scheduling? I used the templated fairscheduler.xml as is when I start pyspark I see the 3 expected pools: production, test, and default when I login as second user and run pyspark I see the

Incremental load of RDD from HDFS?

2015-10-20 Thread Chris Spagnoli
I am new to Spark, and this user community, so my apologies if this was answered elsewhere and I missed it (I did try search first). We have multiple large RDDs stored across a HDFS via Spark (by calling pairRDD.saveAsNewAPIHadoopFile()), and one thing we need to do is re-load a given RDD (by

multiple pyspark instances simultaneously (same time)

2015-10-20 Thread Jeff Sadowski
I am having issues trying to setup spark to run jobs simultaneously. I thought I wanted FAIR scheduling? I used the templated fairscheduler.xml as is when I start pyspark I see the 3 expected pools: production, test, and default when I login as second user and run pyspark I see the expected

Re: Spark SQL: Preserving Dataframe Schema

2015-10-20 Thread Michael Armbrust
For compatibility reasons, we always write data out as nullable in parquet. Given that that bit is only an optimization that we don't actually make much use of, I'm curious why you are worried that its changing to true? On Tue, Oct 20, 2015 at 8:24 AM, Jerry Lam wrote: >

How to change the compression format when using SequenceFileOutputFormat with Spark

2015-10-20 Thread ๏̯͡๏
My Code: val dwsite = sc.sequenceFile("/sys/edw/dw_sites/snapshot/2015/10/18/00/part-r-0",classOf[Text], classOf[Text]) val records = dwsite.filter { case (k, v) => if(v.toString.indexOf("Bhutan") != -1) true else false }

Preemption with Spark on Yarn

2015-10-20 Thread surbhi.mungre
Hi All, I am new to Spark and I am trying to understand how preemption works with Spark on Yarn. My goal is to determine amount of re-work a Spark application has to do if an executor is preempted. For my test, I am using a 4 node cluster with Cloudera VM running Spark 1.3.0. I am running

Re: Multiple joins in Spark

2015-10-20 Thread Xiao Li
Are you using hiveContext? First, build your Spark using the following command: mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package Then, try this sample program object SimpleApp { case class Individual(name: String, surname: String, birthDate:

Re: [spark1.5.1] HiveQl.parse throws org.apache.spark.sql.AnalysisException: null

2015-10-20 Thread Xiao Li
Just curious why you are using parseSql APIs? It works well if you use the external APIs. For example, in your case: val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) hiveContext.sql("CREATE EXTERNAL TABLE IF NOT EXISTS `t`(`id` STRING, `foo` INT) PARTITIONED BY (year INT, month

RE: Hive custom transform scripts in Spark?

2015-10-20 Thread Yang Wu (Tata Consultancy Services)
Yes. We are trying to run a custom script written in C# using TRANSFORM, but cannot get it work. The query and error are below. Any suggestions? Thank you! Spark version: 1.3 Here is how we add and invoke the script: scala> hiveContext.sql("""ADD FILE wasb://… /NSSGraphHelper.exe""")

Re: Hive custom transform scripts in Spark?

2015-10-20 Thread Michael Armbrust
Yeah, I don't think this feature was designed to work on systems that don't have bash. You could open a JIRA. On Tue, Oct 20, 2015 at 10:36 AM, Yang Wu (Tata Consultancy Services) < v-wuy...@microsoft.com> wrote: > Yes. > > We are trying to run a custom script written in C# using TRANSFORM, but

Re: Spark SQL Exception: Conf non-local session path expected to be non-null

2015-10-20 Thread Yin Huai
btw, what version of Spark did you use? On Mon, Oct 19, 2015 at 1:08 PM, YaoPau wrote: > I've connected Spark SQL to the Hive Metastore and currently I'm running > SQL > code via pyspark. Typically everything works fine, but sometimes after a > long-running Spark SQL job I

Re: Spark SQL: Preserving Dataframe Schema

2015-10-20 Thread Xiao Li
Let me share my 2 cents. First, this is not documented in the official document. Maybe we should do it? http://spark.apache.org/docs/latest/sql-programming-guide.html Second, nullability is a significant concept in the database people. It is part of schema. Extra codes are needed for evaluating

Re: Incremental load of RDD from HDFS?

2015-10-20 Thread Ali Tajeldin EDU
I could be misreading the code, but looking at the code for toLocalIterator (copied below), it should lazily call runJob on each partition in your input. It shouldn't be parsing the entire RDD before returning from the first "next" call. If it is taking a long time on the first "next" call, it

Spark SQL: Preserving Dataframe Schema

2015-10-20 Thread Jerry Lam
Hi Spark users and developers, I have a dataframe with the following schema (Spark 1.5.1): StructType(StructField(type,StringType,true), StructField(timestamp,LongType,false)) After I save the dataframe in parquet and read it back, I get the following schema:

Re: Top 10 count

2015-10-20 Thread Carol McDonald
// sort by 2nd element Sorting.quickSort(pairs)(Ordering.by[(String, Int, Int), Int](_._2)) // sort by the 3rd element, then 1st Sorting.quickSort(pairs)(Ordering[(Int, String)].on(x => (x._3, x._1))) On Tue, Oct 20, 2015 at 11:33 AM, Carol McDonald wrote: > this works

Re: mailing list subscription

2015-10-20 Thread Sean Owen
Nabble is not related to the mailing list, so I think that's the problem. spark.apache.org -> Community -> Mailing Lists: http://spark.apache.org/community.html These instructions from the project website itself are correct. On Tue, Oct 20, 2015 at 5:22 PM, Jeff Sadowski

Re: Top 10 count

2015-10-20 Thread Carol McDonald
this works val top10 = logs.filter(log => log.responseCode != 200).map(log => (log.endpoint, 1)).reduceByKey(_ + _).top(10)(Ordering.by(_._2)) or val top10 = logs.filter(log => log.responseCode != 200).map(log => (log.endpoint, 1)).reduceByKey(_ + _).top(10)(Ordering.by(_._2)) On Tue, Oct 20,

Re: Spark opening to many connection with zookeeper

2015-10-20 Thread Ted Yu
I need to dig deeper into saveAsHadoopDataset to see what might have caused the effect you observed. Cheers On Tue, Oct 20, 2015 at 8:57 AM, Amit Hora wrote: > Hi Ted, > > I made mistake last time yes the connection are very controlled when I > used put like iterated over

Re: Spark: How to find similar text title

2015-10-20 Thread Sonal Goyal
Do you want to compare within the rdd or do you have some external list or data coming in ? For matching, you could look at string edit distances or cosine similarity if you are only comparing title strings. On Oct 20, 2015 9:09 PM, "Ascot Moss" wrote: > Hi, > > I have my

Re: Spark 1.5 Streaming and Kinesis

2015-10-20 Thread Jean-Baptiste Onofré
Hi Phil, as you can see in the Jira, I tried with both Spark 1.5.1 and 1.6.0-SNAPSHOT, and I'm not able to reproduce your issue. What's your Java version (and I guess you use scala 2.10 for compilation, not -Pscala-2.11) ? Regards JB On 10/20/2015 03:59 PM, Jean-Baptiste Onofré wrote: Hi

RE: Spark opening to many connection with zookeeper

2015-10-20 Thread Amit Hora
Hi Ted, I made mistake last time yes the connection are very controlled when I used put like iterated over rdd for each and within that for each partition made connection and executed put list for hbase But why it was that the connection were getting too much when I used hibconf and

Re: Hive custom transform scripts in Spark?

2015-10-20 Thread Michael Armbrust
We support TRANSFORM. Are you having a problem using it? On Tue, Oct 20, 2015 at 8:21 AM, wuyangjack wrote: > How to reuse hive custom transform scripts written in python or c++? > > These scripts process data from stdin and print to stdout in spark. > They use the

Hive custom transform scripts in Spark?

2015-10-20 Thread wuyangjack
How to reuse hive custom transform scripts written in python or c++? These scripts process data from stdin and print to stdout in spark. They use the Transform Syntax in Hive: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform Example in Hive: SELECT TRANSFORM(stuff)

Spark: How to find similar text title

2015-10-20 Thread Ascot Moss
Hi, I have my RDD that stores the titles of some articles: 1. "About Spark Streaming" 2. "About Spark MLlib" 3. "About Spark SQL" 4. "About Spark Installation" 5. "Kafka Streaming" 6. "Kafka Setup" 7. I need to build a model to find titles by similarity, e.g if given "About Spark", hope to

Re: Can not subscript to mailing list

2015-10-20 Thread Richard Hillegas
Hi Jeff, Hard to say what's going on. I have had problems subscribing to the Apache lists in the past. My problems, which may be different than yours, were caused by replying to the confirmation request from a different email account than the account I was trying to subscribe from. It was easy

Re: How does shuffle work in spark ?

2015-10-20 Thread Kartik Mathur
That will depend on what is your transformation , your code snippet might help . On Tue, Oct 20, 2015 at 1:53 AM, shahid ashraf wrote: > Hi > > Any idea why is 50 GB shuffle read and write for 3.3 gb data > > On Mon, Oct 19, 2015 at 11:58 PM, Kartik Mathur

Re: [spark1.5.1] HiveQl.parse throws org.apache.spark.sql.AnalysisException: null

2015-10-20 Thread Michael Armbrust
Thats not really intended to be a public API as there is some internal setup that needs to be done for Hive to work. Have you created a HiveContext in the same thread? Is there more to that stacktrace? On Tue, Oct 20, 2015 at 2:25 AM, Ayoub wrote: > Hello, > >

Re: Issue in spark batches

2015-10-20 Thread varun sharma
Hi TD, Yes saveToCassandra throws exception. How do I fail that task explicitly if i catch any exceptions?. Right now that batch doesn't fail and remain in hung state. Is there any way I fail that batch so that it can be tried again. Thanks Varun On Tue, Oct 20, 2015 at 2:50 AM, Tathagata Das

Re: Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-20 Thread Deenar Toraskar
also check out wholeTextFiles https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/SparkContext.html#wholeTextFiles(java.lang.String,%20int) On 20 October 2015 at 15:04, Lan Jiang wrote: > As Francois pointed out, you are encountering a classic small file >

Partition for each executor

2015-10-20 Thread t3l
If I have a cluster with 7 nodes, each having an equal amount of cores and create an RDD with sc.parallelize() it looks as if the Spark will always tries to distribute the partitions. Question: (1) Is that something I can rely on? (2) Can I rely that sc.parallelize() will assign partitions to as

Re: Can not subscript to mailing list

2015-10-20 Thread Jean-Baptiste Onofré
Hi Jeff, did you try to send an e-mail to subscribe-u...@spark.apache.org ? Then you will receive a confirmation e-mail, you just have to reply (nothing special to put in the subject, the important thing is the reply-to unique address). Regards JB On 10/20/2015 05:48 PM,

Re: Top 10 count

2015-10-20 Thread Sean Owen
I believe it will be most efficient to let top(n) do the work, rather than sort the whole RDD and then take the first n. The reason is that top and takeOrdered know they need at most n elements from each partition, and then just need to merge those. It's never required to sort the whole thing. I

Can not subscript to mailing list

2015-10-20 Thread jeff.sadow...@gmail.com
I am having issues subscribing to the user@spark.apache.org mailing list. I would like to be added to the mailing list so I can post some configuration questions I have to the list that I do not see asked on the list. When I tried adding myself I got an email titled "confirm subscribe to

RE: Spark opening to many connection with zookeeper

2015-10-20 Thread Amit Hora
Request to share if you come across any hint -Original Message- From: "Ted Yu" Sent: ‎20-‎10-‎2015 21:30 To: "Amit Hora" Cc: "user" Subject: Re: Spark opening to many connection with zookeeper I need to dig deeper into

mailing list subscription

2015-10-20 Thread Jeff Sadowski
is it suppose to be kind of hidden on how to subscribe to the mailing list? When I go to http://apache-spark-user-list.1001560.n3.nabble.com/ (the link that comes up on a google search for the spark mailing list) I was able to register on nabble.com with no issues. Then when I tried posting it

Re: JdbcRDD Constructor

2015-10-20 Thread Deenar Toraskar
You have 2 options a) don't use partitioning, if the table is small spark will only use one task to load it val jdbcDF = sqlContext.read.format("jdbc").options( Map("url" -> "jdbc:postgresql:dbserver", "dbtable" -> "schema.tablename")).load() b) create a view that includes hashcode column

Re: Spark SQL: Preserving Dataframe Schema

2015-10-20 Thread Michael Armbrust
> > First, this is not documented in the official document. Maybe we should do > it? http://spark.apache.org/docs/latest/sql-programming-guide.html > Pull requests welcome. > Second, nullability is a significant concept in the database people. It is > part of schema. Extra codes are needed for

Re: Problem building Spark

2015-10-20 Thread Ted Yu
On my Mac: $ ls -l ~/.m2/repository/org/antlr/antlr/3.2/antlr-3.2.jar -rw-r--r-- 1 tyu staff 895124 Dec 17 2013 /Users/tyu/.m2/repository/org/antlr/antlr/3.2/antlr-3.2.jar Looks like there might be network issue on your computer. Can you check ? Thanks On Tue, Oct 20, 2015 at 1:21 PM,

Whether Spark is appropriate for our use case.

2015-10-20 Thread Aliaksei Tsyvunchyk
Hello all community members, I need opinion of people who was using Spark before and can share there experience to help me select technical approach. I have a project in Proof Of Concept phase, where we are evaluating possibility of Spark usage for our use case. Here is brief task description.

Re: Spark SQL: Preserving Dataframe Schema

2015-10-20 Thread Xiao Li
Sure. Will try to do a pull request this week. Schema evolution is always painful for database people. IMO, NULL is a bad design in the original system R. It introduces a lot of problems during the system migration and data integration. Let me find a possible scenario: RDBMS is used as an ODS.

Re: Issue in spark batches

2015-10-20 Thread Tathagata Das
That is actually a bug in the UI that got fixed in 1.5.1. The batch is actually completing with exception, the UI does not update correctly. On Tue, Oct 20, 2015 at 8:38 AM, varun sharma wrote: > Also, As you can see the timestamps in attached image. batches coming >

Re: Partition for each executor

2015-10-20 Thread Adrian Tanase
I think it should use the default parallelism which by default is equal to the number of cores in your cluster. If you want to control it, specify a value for numSlices - the second param to parallelize(). -adrian On 10/20/15, 6:13 PM, "t3l" wrote: >If I have a

unsubscribe

2015-10-20 Thread Pete Zybrick
> On Oct 20, 2015, at 5:31 PM, ravi.gawai wrote: > > you can use mapfunction.. > > This is java example.. > > final JavaRDD rdd1 = sc.textFile("filepath").map((line) -> { > //logic for line to product converstion}); > > Product class might have 5 attributes like you

Get statistic result from RDD

2015-10-20 Thread Shepherd
Hi all, I am really newie in Spark and Scala. I cannot get the statistic result from a RDD. Is someone could help me on this? Current code is as follows: /import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ val sqlContext = new

Re: Spark SQL: Preserving Dataframe Schema

2015-10-20 Thread Richard Hillegas
As an academic aside, note that all datatypes are nullable according to the SQL Standard. NOT NULL is modelled in the Standard as a constraint on data values, not as a parallel universe of special data types. However, very few databases implement NOT NULL via integrity constraints. Instead,

Re: Filter RDD

2015-10-20 Thread ravi.gawai
you can use mapfunction.. This is java example.. final JavaRDD rdd1 = sc.textFile("filepath").map((line) -> { //logic for line to product converstion}); Product class might have 5 attributes like you said class Product{ String str1; int i1; String str2; int i2; String str3; // with getter

Re: Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-20 Thread Lan Jiang
I think the data file is binary per the original post. So in this case, sc.binaryFiles should be used. However, I still recommend against using so many small binary files as 1. They are not good for batch I/O 2. They put too many memory pressure on namenode. Lan > On Oct 20, 2015, at 11:20

RE: Get statistic result from RDD

2015-10-20 Thread ChengBo
Thanks, but I still don’t get it. I have used groupBy to group data by userID, and for each ID, I need to get the statistic information. Best Frank From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Tuesday, October 20, 2015 3:12 PM To: ChengBo Cc: user Subject: Re: Get statistic result from RDD

Re: hive thriftserver and fair scheduling

2015-10-20 Thread Sadhan Sood
Thanks Michael, I'll try it out. Another quick/important question: How do I make udfs available to all of the hive thriftserver users? Right now, when I launch a spark-sql client, I notice that it reads the ~/.hiverc file and all udfs get picked up but this doesn't seem to be working in hive

SF Spark Office Hours Experiment - Friday Afternoon

2015-10-20 Thread Holden Karau
Hi SF based folks, I'm going to try doing some simple office hours this Friday afternoon outside of Paramo Coffee. If no one comes by I'll just be drinking coffee hacking on some Spark PRs so if you just want to hangout and hack on Spark as a group come by too. (See

Re: unsubscribe

2015-10-20 Thread Ted Yu
Pete: Please don't mix unrelated email on the back of another thread. To unsubscribe, see first section of https://spark.apache.org/community On Tue, Oct 20, 2015 at 2:42 PM, Pete Zybrick wrote: > > >

Re: Spark SQL Exception: Conf non-local session path expected to be non-null

2015-10-20 Thread Jon Gregg
1.3 on cdh 5.4.4 ... I'll take the responses to mean that the fix will be probably a few months away for us. Not a huge problem but something I've run into a number of times. On Tue, Oct 20, 2015 at 3:01 PM, Yin Huai wrote: > btw, what version of Spark did you use? > > On

Re: Get statistic result from RDD

2015-10-20 Thread Ted Yu
Please take a look at: examples/src/main/scala/org/apache/spark/examples/streaming/HdfsWordCount.scala Cheers On Tue, Oct 20, 2015 at 3:18 PM, ChengBo wrote: > Thanks, but I still don’t get it. > > I have used groupBy to group data by userID, and for each ID, I need to >

RE: Get statistic result from RDD

2015-10-20 Thread ChengBo
I tried, but it shows: “error: value reduceByKey is not a member of iterable[((Int, Int, String, String), String), Int]” Best Frank From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Tuesday, October 20, 2015 3:46 PM To: ChengBo Cc: user Subject: Re: Get statistic result from RDD Please take a

Re: Get statistic result from RDD

2015-10-20 Thread Ted Yu
Your mapValues can emit a tuple. If p(0) is between 0 and 5, first component of tuple would be 1, second being 0. If p(0) is 6 or 7, first component of tuple would be 0, second being 1. You can use reduceByKey to sum up corresponding component. On Tue, Oct 20, 2015 at 1:33 PM, Shepherd

Top 10 count

2015-10-20 Thread Carol McDonald
To find the top 10 counts , which is better using top(10) with Ordering on the value, or swapping the key value and ordering on the key ? For example which is better below ? Or does it matter val top10 = logs.filter(log => log.responseCode != 200).map(log => (log.endpoint, 1)).reduceByKey(_ +

RE: Spark opening to many connection with zookeeper

2015-10-20 Thread Amit Hora
I used that also but the number of connection goes on increasing started frm 10 and went till 299 Than I changed my zookeeper conf to set max client connection to just 30 and restarted job Now the connections are between 18- 24 from last 2 hours I am unable to understand such a behaviour

RE: Get statistic result from RDD

2015-10-20 Thread ChengBo
I tried, but it shows: “error: value reduceByKey is not a member of iterable[((Int, Int, String, String), String), Int]” Best Cheng From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Tuesday, October 20, 2015 3:46 PM To: ChengBo Cc: user Subject: Re: Get statistic result from RDD Please take a

spark-shell (1.5.1) not starting cleanly on Windows.

2015-10-20 Thread Renato Perini
I'm using Spark on Windows 10 (64 bit) and jdk8u60 When I start spark-shell, I obtain many warnings and exceptions. The system complains about already registered datanucleus plugins, emit exceptions and so on: C:\test>spark-shell log4j:WARN No appenders could be found for logger

Job splling to disk and memory in Spark Streaming

2015-10-20 Thread swetha
Hi, Currently I have a job that has spills to disk and memory due to usage of reduceByKey and a lot of intermediate data in reduceByKey that gets shuffled. How to use custom partitioner in Spark Streaming for an intermediate stage so that the next stage that uses reduceByKey does not have to

How to distinguish columns when joining DataFrames with shared parent?

2015-10-20 Thread Isabelle Phan
Hello, When joining 2 DataFrames which originate from the same initial DataFrame, why can't org.apache.spark.sql.DataFrame.apply(colName: String) method distinguish which column to read? Let me illustrate this question with a simple example (ran on Spark 1.5.1): //my initial DataFrame scala> df

Re: Location preferences in pyspark?

2015-10-20 Thread Jeff Zhang
Yes, I don't think there is. You can use SparkContext.parallelize() to make a RDD from a list. But no location preferences supported yet. On Sat, Oct 17, 2015 at 8:42 AM, Philip Weaver wrote: > I believe what I want is the exact functionality provided by >

RE: Spark opening to many connection with zookeeper

2015-10-20 Thread Amit Hora
Hi All, I am using Hbase 1.1.1 ,I came across a post describing hbase-spark included in hbase core I am trying to use HbaseContext but cnt find the appropriate lib while trying to add following in pim I am getting missing artifact error Org.apache.hbase Hbase 1.1.1 -Original Message-

Re: Location preferences in pyspark?

2015-10-20 Thread Philip Weaver
Thanks, we decided to try to add the support ourselves :). On Tue, Oct 20, 2015 at 6:40 PM, Jeff Zhang wrote: > Yes, I don't think there is. You can use SparkContext.parallelize() to > make a RDD from a list. But no location preferences supported yet. > > On Sat, Oct 17, 2015

Spark_1.5.1_on_HortonWorks

2015-10-20 Thread Ajay Chander
Hi Everyone, Any one has any idea if spark-1.5.1 is available as a service on HortonWorks ? I have spark-1.3.1 installed on the Cluster and it is a HortonWorks distribution. Now I want upgrade it to spark-1.5.1. Anyone here have any idea about it? Thank you in advance. Regards, Ajay

Re: Reading JSON in Pyspark throws scala.MatchError

2015-10-20 Thread Jeff Zhang
BTW, I think Json Parser should verify the json format at least when inferring the schema of json. On Wed, Oct 21, 2015 at 12:59 PM, Jeff Zhang wrote: > I think this is due to the json file format. DataFrame can only accept > json file with one valid record per line.

Re: Spark_1.5.1_on_HortonWorks

2015-10-20 Thread Frans Thamura
Doug is it possible to put in HDP 2.3? esp in Sandbox can share how do you install it? F -- Frans Thamura (曽志胜) Java Champion Shadow Master and Lead Investor Meruvian. Integrated Hypermedia Java Solution Provider. Mobile: +628557888699 Blog: http://blogs.mervpolis.com/roller/flatburger (id)

Re: Lost leader exception in Kafka Direct for Streaming

2015-10-20 Thread swetha kasireddy
Hi Cody, What other options do I have other than monitoring and restarting the job? Can the job recover automatically? Thanks, Sweth On Thu, Oct 1, 2015 at 7:18 AM, Cody Koeninger wrote: > Did you check you kafka broker logs to see what was going on during that > time? > >

Re: Reading JSON in Pyspark throws scala.MatchError

2015-10-20 Thread Jeff Zhang
I think this is due to the json file format. DataFrame can only accept json file with one valid record per line. Multiple line per record is invalid for DataFrame. On Tue, Oct 6, 2015 at 2:48 AM, Davies Liu wrote: > Could you create a JIRA to track this bug? > > On

How to get Histogram of all columns in a large CSV / RDD[Array[double]] ?

2015-10-20 Thread DEVAN M.S.
Hi all, I am trying to calculate Histogram of all columns from a CSV file using Spark Scala. I found that DoubleRDDFunctions supporting Histogram. So i coded like following for getting histogram of all columns. 1. Get column count 2. Create RDD[double] of each column and calculate Histogram of

Re: Reading JSON in Pyspark throws scala.MatchError

2015-10-20 Thread Balaji Vijayan
You are correct, that was the issue. On Tue, Oct 20, 2015 at 10:18 PM, Jeff Zhang wrote: > BTW, I think Json Parser should verify the json format at least when > inferring the schema of json. > > On Wed, Oct 21, 2015 at 12:59 PM, Jeff Zhang wrote: > >> I

Re: Spark_1.5.1_on_HortonWorks

2015-10-20 Thread Doug Balog
I have been running 1.5.1 with Hive in secure mode on HDP 2.2.4 without any problems. Doug > On Oct 21, 2015, at 12:05 AM, Ajay Chander wrote: > > Hi Everyone, > > Any one has any idea if spark-1.5.1 is available as a service on HortonWorks > ? I have spark-1.3.1

Re: Spark_1.5.1_on_HortonWorks

2015-10-20 Thread Saisai Shao
Hi Frans, You could download Spark 1.5.1-hadoop 2.6 pre-built tarball and copy into HDP 2.3 sandbox or master node. Then copy all the conf files from /usr/hdp/current/spark-client/ to your /conf, or you could refer to this tech preview (

Re: Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-20 Thread François Pelletier
You should aggregate your files in larger chunks before doing anything else. HDFS is not fit for small files. It will bloat it and cause you a lot of performance issues. Target a few hundred MB chunks partition size and then save those files back to hdfs and then delete the original ones. You can

  1   2   >