Re: Why chinese character gash appear when i use spark textFile?

2017-04-05 Thread Yan Facai
Perhaps your file is not utf-8. I cannot reconstruct it. ### HADOOP: ~/Downloads ❯❯❯ hdfs -cat hdfs:///test.txt 17/04/06 13:43:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 1.0 862910025238798

Re: Read file and represent rows as Vectors

2017-04-05 Thread Yan Facai
You can try `mapPartitions` method. example as below: http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html#mapPartitions On Mon, Apr 3, 2017 at 8:05 PM, Old-School wrote: > I have a dataset that contains DocID, WordID and frequency (count) as

How spark connects to Hive metastore?

2017-04-05 Thread infaelance
Hi all, When using spark-shell my understanding is spark connects to hive through metastore. The question i have is does spark connect to metastore , is it JDBC? Any good links for documents on how spark connects to hive or other data sources? -- View this message in context:

Spark and Hive connection

2017-04-05 Thread infa elance
Hi all, When using spark-shell my understanding is spark connects to hive through metastore. The question i have is does spark connect to metastore , is it JDBC? Thanks and Regards, Ajay.

Consuming AWS Cloudwatch logs from Kinesis into Spark

2017-04-05 Thread Tim Smith
I am sharing this code snippet since I spent quite some time figuring it out and I couldn't find any examples online. Between the Kinesis documentation, tutorial on AWS site and other code snippets on the Internet, I was confused about structure/format of the messages that Spark fetches from

Why chinese character gash appear when i use spark textFile?

2017-04-05 Thread Jone Zhang
var textFile = sc.textFile("xxx"); textFile.first(); res1: String = 1.0 100733314 18_?:100733314 8919173c6d49abfab02853458247e5841:129:18_?:1.0 hadoop fs -cat xxx 1.0100733314 18_百度输入法:100733314 8919173c6d49abfab02853458247e584 1:129:18_百度输入法:1.0 Why

Why chinese character gash appear when i use spark textFile?

2017-04-05 Thread JoneZhang
var textFile = sc.textFile("xxx"); textFile.first(); res1: String = 1.0 862910025238798 100733314 18_?:100733314 8919173c6d49abfab02853458247e5841:129:18_?:1.0 hadoop fs -cat xxx 1.0 862910025238798 100733314 18_百度输入法:100733314

Re: JSON lib works differently in spark-shell and IDE like intellij

2017-04-05 Thread Mungeol Heo
It will work with spark-submit, if putting the configuration, which is addressed below, under the maven-shade-plugin. net.minidev shaded.net.minidev Still, need a way to make it work with spark-shell for testing purpose. Any idea will be grate. Thank you. On Wed,

run-time exception trying to train MultilayerPerceptronClassifier with DataFrame

2017-04-05 Thread Pete Prokopowicz
Hello, I am trying to train a neural net using a dataframe constructed from an RDD of LabeledPoints. The data frame's schema is: [label: double, features: vector] The actual features values are SparseVectors. The runtime error I get when I call val labeledPoints: RDD[LabeledPoint] =

Master-Worker communication on Standalone cluster issues

2017-04-05 Thread map reduced
Hi, I was wondering on how often does Worker pings Master to check on Master's liveness? Or is it the Master (Resource manager) that pings Workers to check on their liveness and if any workers are dead to spawn ? Or is it both? Some info: Standalone cluster 1 Master - 8core 12Gb 32 workers -

Re: Spark fair scheduler pools vs. YARN queues

2017-04-05 Thread Nicholas Chammas
Ah, that's why all the stuff about scheduler pools is under the section "Scheduling Within an Application ".  I am so used to talking to my coworkers about jobs in sense of applications that I forgot your

Re: Spark fair scheduler pools vs. YARN queues

2017-04-05 Thread Mark Hamstra
`spark-submit` creates a new Application that will need to get resources from YARN. Spark's scheduler pools will determine how those resources are allocated among whatever Jobs run within the new Application. Spark's scheduler pools are only relevant when you are submitting multiple Jobs within a

Re: Spark fair scheduler pools vs. YARN queues

2017-04-05 Thread Nicholas Chammas
Hmm, so when I submit an application with `spark-submit`, I need to guarantee it resources using YARN queues and not Spark's scheduler pools. Is that correct? When are Spark's scheduler pools relevant/useful in this context? On Wed, Apr 5, 2017 at 3:54 PM Mark Hamstra

Re: convert JavaRDD<List> to JavaRDD

2017-04-05 Thread Vinay Parekar
Sorry, I guess my mac autocorrected it . Yeah its flatmap(). From: Jiang Jacky > Date: Wednesday, April 5, 2017 at 12:10 PM To: ostkadm > Cc: Hamza HACHANI

Re: Spark fair scheduler pools vs. YARN queues

2017-04-05 Thread Mark Hamstra
grrr... s/your/you're/ On Wed, Apr 5, 2017 at 12:54 PM, Mark Hamstra wrote: > Your mixing up different levels of scheduling. Spark's fair scheduler > pools are about scheduling Jobs, not Applications; whereas YARN queues with > Spark are about scheduling Applications,

Re: Spark fair scheduler pools vs. YARN queues

2017-04-05 Thread Mark Hamstra
Your mixing up different levels of scheduling. Spark's fair scheduler pools are about scheduling Jobs, not Applications; whereas YARN queues with Spark are about scheduling Applications, not Jobs. On Wed, Apr 5, 2017 at 12:27 PM, Nick Chammas wrote: > I'm having

Spark fair scheduler pools vs. YARN queues

2017-04-05 Thread Nick Chammas
I'm having trouble understanding the difference between Spark fair scheduler pools and YARN queues . Do they conflict? Does one

RE: How to use ManualClock with Spark streaming

2017-04-05 Thread Mendelson, Assaf
You can try taking a look at this: http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/ Thanks, Assaf. From: Hemalatha A [mailto:hemalatha.amru...@googlemail.com] Sent: Wednesday, April 05, 2017 1:59 PM To: Saisai Shao; user@spark.apache.org Subject: Re: How to use

Re: convert JavaRDD<List> to JavaRDD

2017-04-05 Thread Jiang Jacky
There is no flattop just flatMap > On Apr 5, 2017, at 12:24 PM, Vinay Parekar wrote: > > I think flattop() will be helpful in this case . Correct me if I am wrong. > > From: Hamza HACHANI > Date: Wednesday, April 5, 2017 at 3:43 AM > To:

Spark Streaming Kafka Job has strange behavior for certain tasks

2017-04-05 Thread Justin Miller
Greetings! I've been running various spark streaming jobs to persist data from kafka topics and one persister in particular seems to have issues. I've verified that the number of messages is the same per partition (roughly of course) and the volume of data is a fraction of the volume of other

Re: how do i force unit test to do whole stage codegen

2017-04-05 Thread Jacek Laskowski
Thanks Koert for the kind words. That part however is easy to fix and was surprised to have seen the old style referenced (!) Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark Follow me at

Re: bug with PYTHONHASHSEED

2017-04-05 Thread Paul Tremblay
I saw the bug fix. I am using the latest Spark available on AWS EMR which I think is 2.01. I am at work and can't check my home config. I don't think AWS merged in this fix. Henry On Tue, Apr 4, 2017 at 4:42 PM, Jeff Zhang wrote: > > It is fixed in

Re: convert JavaRDD<List> to JavaRDD

2017-04-05 Thread Vinay Parekar
I think flattop() will be helpful in this case . Correct me if I am wrong. From: Hamza HACHANI > Date: Wednesday, April 5, 2017 at 3:43 AM To: "user@spark.apache.org"

Re: how do i force unit test to do whole stage codegen

2017-04-05 Thread Koert Kuipers
its pretty much impossible to be fully up to date with spark given how fast it moves! the book is a very helpful reference On Wed, Apr 5, 2017 at 11:15 AM, Jacek Laskowski wrote: > Hi, > > I'm very sorry for not being up to date with the current style (and > "promoting" the

Re: Why do we ever run out of memory in Spark Structured Streaming?

2017-04-05 Thread kant kodali
Actually I want to reset my counters every 24 hours then shouldn't the window and slide interval = 24 hours. If so, how do I send updates to real time dashboard every second? isn't the trigger interval is the same as slide interval ? On Wed, Apr 5, 2017 at 7:17 AM, kant kodali

Re: how do i force unit test to do whole stage codegen

2017-04-05 Thread Jacek Laskowski
Hi, I'm very sorry for not being up to date with the current style (and "promoting" the old style) and am going to review that part soon. I'm very close to touch it again since I'm with Optimizer these days. Jacek On 5 Apr 2017 6:08 a.m., "Kazuaki Ishizaki" wrote: > Hi, >

Re: With Twitter4j API, why am I not able to pull tweets with certain keywords?

2017-04-05 Thread Ian.Maloney
I think the twitter4j API only pulls some publicly available data. To get the full dataset you might need to use a vendor like Radian6 or GnipŠ See below: https://brightplanet.com/2013/06/twitter-firehose-vs-twitter-api-whats-the- difference-and-why-should-you-care/ On 4/5/17, 12:02 AM,

Re: Why do we ever run out of memory in Spark Structured Streaming?

2017-04-05 Thread kant kodali
One of our requirement is that we need to maintain counter for a 24 hour period such as number of transactions processed in the past 24 hours. After each day these counters can start from zero again so we just need to maintain a running count during the 24 hour period. Also since we want to show

Re: unit testing in spark

2017-04-05 Thread Shiva Ramagopal
Hi, I've been following this thread for a while. I'm trying to bring in a test strategy in my team to test a number of data pipelines before production. I have watched Lars' presentation and find it great. However I'm debating whether unit tests are worth the effort if there are good job-level

Re: How to use ManualClock with Spark streaming

2017-04-05 Thread Hemalatha A
Any updates on how can I use ManualClock other than editing the Spark source code? On Wed, Mar 1, 2017 at 10:19 AM, Hemalatha A < hemalatha.amru...@googlemail.com> wrote: > It is certainly possible through a hack. > I was referring to below post where TD says it is possible thru a hack. I >

JSON lib works differently in spark-shell and IDE like intellij

2017-04-05 Thread Mungeol Heo
Hello, I am using "minidev" which is a JSON lib to remove duplicated keys in JSON object. minidev net.minidev json-smart 2.3 Test Code import net.minidev.json.parser.JSONParser val badJson =

Re: convert JavaRDD<List> to JavaRDD

2017-04-05 Thread hosur narahari
Use flatmap function on JavaRDD On 5 Apr 2017 3:13 p.m., "Hamza HACHANI" wrote: > I want to convert a JavaRDD to JavaRDD. For example > if there is 3 elment in List 3 Object would be created in my new > JavaRDD. > > Does any one have an idea ? >

convert JavaRDD<List> to JavaRDD

2017-04-05 Thread Hamza HACHANI
I want to convert a JavaRDD to JavaRDD. For example if there is 3 elment in List 3 Object would be created in my new JavaRDD. Does any one have an idea ?

Re: Market Basket Analysis by deploying FP Growth algorithm

2017-04-05 Thread Patrick Plaatje
Hi Arun, We have been running into the same issue (having only 1000 unique items, in 100MM transactions), but have not investigated the root cause of this. We decided to run this on a cluster instead (4*16 / 64GB Ram), after which the OOM issue went away. However, we ran into the issue that

Market Basket Analysis by deploying FP Growth algorithm

2017-04-05 Thread asethia
Hi, We are currently working on a Market Basket Analysis by deploying FP Growth algorithm on Spark to generate association rules for product recommendation. We are running on close to 24 million invoices over an assortment of more than 100k products. However, whenever we relax the support

Re: Why do we ever run out of memory in Spark Structured Streaming?

2017-04-05 Thread kant kodali
Hi! I am talking about "stateful operations like aggregations". Does this happen on heap or off heap by default? I came across a article where I saw both on and off heap are possible but I am not sure what happens by default and when Spark or Spark Structured Streaming decides to store off heap?

reading binary file in spark-kafka streaming

2017-04-05 Thread Yogesh Vyas
Hi, I am having a binary file which I try to read in Kafka Producer and send to message queue. This I read in the Spark-Kafka consumer as streaming job. But it is giving me following error: UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 112: invalid start byte Can anyone