Re: How to remove empty strings from JavaRDD

2016-04-07 Thread Chris Miller
flatmap? -- Chris Miller On Thu, Apr 7, 2016 at 10:25 PM, greg huang <debin.hu...@gmail.com> wrote: > Hi All, > >Can someone give me a example code to get rid of the empty string in > JavaRDD? I kwon there is a filter method in JavaRDD: > https://spark.apache.org/do

Re: Spark schema evolution

2016-03-22 Thread Chris Miller
With Avro you solve this by using a default value for the new field... maybe Parquet is the same? -- Chris Miller On Tue, Mar 22, 2016 at 9:34 PM, gtinside <gtins...@gmail.com> wrote: > Hi , > > I have a table sourced from* 2 parquet files* with few extra columns in one > o

Re: newbie HDFS S3 best practices

2016-03-16 Thread Chris Miller
If you have lots of small files, distcp should handle that well -- it's supposed to distribute the transfer of files across the nodes in your cluster. Conductor looks interesting if you're trying to distribute the transfer of single, large file(s)... right? -- Chris Miller On Wed, Mar 16, 2016

Re: Does parallelize and collect preserve the original order of list?

2016-03-16 Thread Chris Miller
Short answer: Nope Less short answer: Spark is not designed to maintain sort order in this case... it *may*, but there's no guarantee... generally, it would not be in the same order unless you implement something to order by and then sort the result based on that. -- Chris Miller On Wed, Mar 16

Re: reading file from S3

2016-03-16 Thread Chris Miller
described. -- Chris Miller On Tue, Mar 15, 2016 at 11:22 PM, Sabarish Sasidharan < sabarish.sasidha...@manthan.com> wrote: > There are many solutions to a problem. > > Also understand that sometimes your situation might be such. For ex what > if you are accessing S3 from your

Re: Correct way to use spark streaming with apache zeppelin

2016-03-13 Thread Chris Miller
Cool! Thanks for sharing. -- Chris Miller On Sun, Mar 13, 2016 at 12:53 AM, Todd Nist <tsind...@gmail.com> wrote: > Below is a link to an example which Silvio Fiorito put together > demonstrating how to link Zeppelin with Spark Stream for real-time charts. > I think the

Re: Correct way to use spark streaming with apache zeppelin

2016-03-12 Thread Chris Miller
n. > > I'm just wondering, whats the best way to store Stats table( a database or > parquet file?) > What exactly are you trying to do? Zeppelin is for interactive analysis of > a dataset. What do you mean "realtime analytics" -- do you mean build a > report or dashboard t

Re: Repeating Records w/ Spark + Avro?

2016-03-12 Thread Chris Miller
lly, if I add rdd.persist(), then it doesn't work. I guess I would need to do .map(_._1.datum) again before the map that does the real work. -- Chris Miller On Sat, Mar 12, 2016 at 4:15 PM, Chris Miller <cmiller11...@gmail.com> wrote: > Wow! That sure is buried in the documentation! But yeah, t

Re: Repeating Records w/ Spark + Avro?

2016-03-12 Thread Chris Miller
;myValue")) }) * What am I doing wrong? -- Chris Miller On Sat, Mar 12, 2016 at 1:48 PM, Peyman Mohajerian <mohaj...@gmail.com> wrote: > Here is the reason for the behavior: > '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable > object for each record, direc

Re: Correct way to use spark streaming with apache zeppelin

2016-03-12 Thread Chris Miller
What exactly are you trying to do? Zeppelin is for interactive analysis of a dataset. What do you mean "realtime analytics" -- do you mean build a report or dashboard that automatically updates as new data comes in? -- Chris Miller On Sat, Mar 12, 2016 at 3:13 PM, trung kien <kient

Repeating Records w/ Spark + Avro?

2016-03-11 Thread Chris Miller
he datum? Seems I'm not the only one who ran into this problem: https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/102. I can't figure out how to fix it in my case without hacking away like the person in the linked PR did. Suggestions? -- Chris Miller

Re: Avro SerDe Issue w/ Manual Partitions?

2016-03-06 Thread Chris Miller
For anyone running into this same issue, it looks like Avro deserialization is just broken when used with SparkSQL and partitioned schemas. I created an bug report with details and a simplified example on how to reproduce: https://issues.apache.org/jira/browse/SPARK-13709 -- Chris Miller On Fri

Re: Is Spark right for us?

2016-03-06 Thread Chris Miller
Gut instinct is no, Spark is overkill for your needs... you should be able to accomplish all of that with a relational database or a column oriented database (depending on the types of queries you most frequently run and the performance requirements). -- Chris Miller On Mon, Mar 7, 2016 at 1:17

Re: MLLib + Streaming

2016-03-06 Thread Chris Miller
Guru:This is a really great response. Thanks for taking the time to explain all of this. Helpful for me too. -- Chris Miller On Sun, Mar 6, 2016 at 1:54 PM, Guru Medasani <gdm...@gmail.com> wrote: > Hi Lan, > > Streaming Means, Linear Regression and Logistic Regression support o

Re: Best way to merge files from streaming jobs‏ on S3

2016-03-04 Thread Chris Miller
of writing to the file from coalesce, sort that data structure, then write your file. -- Chris Miller On Sat, Mar 5, 2016 at 5:24 AM, jelez <je...@hotmail.com> wrote: > My streaming job is creating files on S3. > The problem is that those files end up very small if I just write them to >

Re: Avro SerDe Issue w/ Manual Partitions?

2016-03-03 Thread Chris Miller
work fine with Hive, and I imagine the same deserializer code is used there too. Thoughts? -- Chris Miller On Thu, Mar 3, 2016 at 9:38 PM, Igor Berman <igor.ber...@gmail.com> wrote: > your field name is > *enum1_values* > > but you have data > { "foo1": "te

Re: Avro SerDe Issue w/ Manual Partitions?

2016-03-03 Thread Chris Miller
eTool.java:99) at org.apache.avro.tool.Main.run(Main.java:84) at org.apache.avro.tool.Main.main(Main.java:73) Any other ideas? -- Chris Miller On Thu, Mar 3, 2016 at 9:38 PM, Igor Berman <igor.ber...@gmail.com> wrote: > your field name is > *enum1_valu

Re: Avro SerDe Issue w/ Manual Partitions?

2016-03-03 Thread Chris Miller
r is the same. I also tried querying from Scala instead of using Zeppelin, and I get the same error. Where should I begin with troubleshooting this problem? This same query runs fine on Hive. Based on the error, it appears to be something in the deserializer though... but if it were a bu

Avro SerDe Issue w/ Manual Partitions?

2016-03-02 Thread Chris Miller
my schema. This same table and query structure works fine with Hive. When I try to run this with SparkSQL, however, I get the above error. Anyone have any idea what the problem is here? Thanks! -- Chris Miller