New to spark 2.2.1 - Problem with finding tables between different metastore db

2018-02-06 Thread Subhajit Purkayastha
All,

 

I am new to Spark 2.2.1.  I have a single node cluster and also have enabled
thriftserver for my Tableau application to connect to my persisted table. 

 

I feel that the spark cluster metastore is different from the thrift-server
metastore. If this assumption is valid, what do I need to do to make it a
universal metastore?

 

Once I have fixed this issue, I will enable another node in my cluster

 

Thanks,

 

Subhajit 

 



Re: New to spark.

2016-09-28 Thread Bryan Cutler
Hi Anirudh,

All types of contributions are welcome, from code to documentation.  Please
check out the page at
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark for
some info, specifically keep a watch out for starter JIRAs here
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20labels%20%3D%20Starter%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)
.

On Wed, Sep 28, 2016 at 9:11 AM, Anirudh Muhnot  wrote:

> Hello everyone, I'm Anirudh. I'm fairly new to spark as I've done an
> online specialisation from UC Berkeley. I know how to code in Python but
> have little to no idea about Scala. I want to contribute to Spark, Where do
> I start and how? I'm reading the pull requests at Git Hub but I'm barley
> able to understand them. Can anyone help? Thank you.
> Sent from my iPhone
>


New to spark.

2016-09-28 Thread Anirudh Muhnot
Hello everyone, I'm Anirudh. I'm fairly new to spark as I've done an online 
specialisation from UC Berkeley. I know how to code in Python but have little 
to no idea about Scala. I want to contribute to Spark, Where do I start and 
how? I'm reading the pull requests at Git Hub but I'm barley able to understand 
them. Can anyone help? Thank you. 
Sent from my iPhone

Re: new to Spark - trying to get a basic example to run - could use some help

2016-02-13 Thread Ted Yu
Maybe a comment should be added to SparkPi.scala, telling user to look for
the value in stdout log ?

Cheers

On Sat, Feb 13, 2016 at 3:12 AM, Chandeep Singh 
wrote:

> Try looking at stdout logs. I ran the exactly same job as you and did not
> see anything on the console as well but found it in stdout.
>
> [csingh@<> ~]$ spark-submit   --class org.apache.spark.examples.SparkPi
>  --master yarn--deploy-mode cluster--name RT_SparkPi
> /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/spark-examples-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
>10
>
> Log Type: stdout
>
> Log Upload Time: Sat Feb 13 11:00:08 + 2016
>
> Log Length: 23
>
> Pi is roughly 3.140224
>
>
> Hope that helps!
>
>
> On Sat, Feb 13, 2016 at 3:14 AM, Taylor, Ronald C 
> wrote:
>
>> Hello folks,
>>
>> This is my first msg to the list. New to Spark, and trying to run the
>> SparkPi example shown in the Cloudera documentation.  We have Cloudera
>> 5.5.1 running on a small cluster at our lab, with Spark 1.5.
>>
>> My trial invocation is given below. The output that I get **says** that
>> I “SUCCEEDED” at the end. But – I don’t get any screen output on the value
>> of pi. I also tried a SecondarySort Spark program  that I compiled and
>> jarred from Dr. Parsian’s Data Algorithms book. That program  failed. So –
>> I am focusing on getting SparkPi to work properly, to get started. Can
>> somebody look at the screen output that I cut-and-pasted below and infer
>> what I might be doing wrong?
>>
>> Am I forgetting to set one or more environment variables? Or not setting
>> such properly?
>>
>> Here is the CLASSPATH value that I set:
>>
>>
>> CLASSPATH=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:/people/rtaylor/SparkWork/DataAlgUtils
>>
>> Here is the settings of other environment variables:
>>
>> HADOOP_HOME=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11
>> SPARK_HOME=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11
>>
>> HADOOP_CLASSPATH='/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:$JAVA_HOME/lib/tools.jar'
>>
>> SPARK_CLASSPATH='/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:$JAVA_HOME/lib/tools.jar:':'/people/rtaylor/SparkWork/DataAlgUtils'
>>
>> I am not sure that those env vars are properly set (or if even all of
>> them are needed). But that’s what I’m currently using.
>>
>> As I said, the invocation below appears to terminate with final status
>> set to “SUCCEEDED”. But – there is no screen output on the value of pi,
>> which I understood would be shown. So – something appears to be going
>> wrong. I went to the tracking URL given at the end, but could not access it.
>>
>> I would very much appreciate some guidance!
>>
>>
>>- Ron Taylor
>>
>>
>> %
>>
>> INVOCATION:
>>
>> [rtaylor@bigdatann]$ spark-submit   --class
>> org.apache.spark.examples.SparkPi--master yarn--deploy-mode
>> cluster--name RT_SparkPi
>> /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/spark-examples-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
>>10
>>
>> SLF4J: Class path contains multiple SLF4J bindings.
>> SLF4J: Found binding in
>> [jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in
>> [jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/slf4j-simple-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in
>> [jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/livy-assembly.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in
>> [jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/avro-tools-1.7.6-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in
>> [jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/pig-0.12.0-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
>> explanation.
>> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>> 16/02/12 18:16:59 INFO client.RMProxy: Connecting to ResourceManager at
>> bigdatann.ib/172.17.115.18:8032
>> 16/02/12 18:16:59 INFO yarn.Client: Requesting a new application from
>> cluster with 15 NodeManagers
>> 16/02/12 18:16:59 INFO yarn.Client: Verifying our application has not
>> requested more than the maximum memory capability of the cluster (65536 MB
>> per container)
>> 16/02/12 

Re: new to Spark - trying to get a basic example to run - could use some help

2016-02-13 Thread Chandeep Singh
Try looking at stdout logs. I ran the exactly same job as you and did not
see anything on the console as well but found it in stdout.

[csingh@<> ~]$ spark-submit   --class org.apache.spark.examples.SparkPi
 --master yarn--deploy-mode cluster--name RT_SparkPi
/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/spark-examples-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
   10

Log Type: stdout

Log Upload Time: Sat Feb 13 11:00:08 + 2016

Log Length: 23

Pi is roughly 3.140224


Hope that helps!


On Sat, Feb 13, 2016 at 3:14 AM, Taylor, Ronald C 
wrote:

> Hello folks,
>
> This is my first msg to the list. New to Spark, and trying to run the
> SparkPi example shown in the Cloudera documentation.  We have Cloudera
> 5.5.1 running on a small cluster at our lab, with Spark 1.5.
>
> My trial invocation is given below. The output that I get **says** that I
> “SUCCEEDED” at the end. But – I don’t get any screen output on the value of
> pi. I also tried a SecondarySort Spark program  that I compiled and jarred
> from Dr. Parsian’s Data Algorithms book. That program  failed. So – I am
> focusing on getting SparkPi to work properly, to get started. Can somebody
> look at the screen output that I cut-and-pasted below and infer what I
> might be doing wrong?
>
> Am I forgetting to set one or more environment variables? Or not setting
> such properly?
>
> Here is the CLASSPATH value that I set:
>
>
> CLASSPATH=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:/people/rtaylor/SparkWork/DataAlgUtils
>
> Here is the settings of other environment variables:
>
> HADOOP_HOME=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11
> SPARK_HOME=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11
>
> HADOOP_CLASSPATH='/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:$JAVA_HOME/lib/tools.jar'
>
> SPARK_CLASSPATH='/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:$JAVA_HOME/lib/tools.jar:':'/people/rtaylor/SparkWork/DataAlgUtils'
>
> I am not sure that those env vars are properly set (or if even all of them
> are needed). But that’s what I’m currently using.
>
> As I said, the invocation below appears to terminate with final status set
> to “SUCCEEDED”. But – there is no screen output on the value of pi, which I
> understood would be shown. So – something appears to be going wrong. I went
> to the tracking URL given at the end, but could not access it.
>
> I would very much appreciate some guidance!
>
>
>- Ron Taylor
>
>
> %
>
> INVOCATION:
>
> [rtaylor@bigdatann]$ spark-submit   --class
> org.apache.spark.examples.SparkPi--master yarn--deploy-mode
> cluster--name RT_SparkPi
> /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/spark-examples-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
>10
>
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in
> [jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/slf4j-simple-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/livy-assembly.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/avro-tools-1.7.6-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/pig-0.12.0-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/02/12 18:16:59 INFO client.RMProxy: Connecting to ResourceManager at
> bigdatann.ib/172.17.115.18:8032
> 16/02/12 18:16:59 INFO yarn.Client: Requesting a new application from
> cluster with 15 NodeManagers
> 16/02/12 18:16:59 INFO yarn.Client: Verifying our application has not
> requested more than the maximum memory capability of the cluster (65536 MB
> per container)
> 16/02/12 18:16:59 INFO yarn.Client: Will allocate AM container, with 1408
> MB memory including 384 MB overhead
> 16/02/12 18:16:59 INFO yarn.Client: Setting up container launch context
> for our AM
> 16/02/12 18:16:59 INFO yarn.Client: Setting up the launch environment for
> our AM container
> 16/02/12 18:16:59 INFO yarn.Client: Preparing resources for our AM
> container
> 16/02/12 18:17:00 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 16/02/12 18:17:00 INFO yarn.Client:

new to Spark - trying to get a basic example to run - could use some help

2016-02-12 Thread Taylor, Ronald C
Hello folks,

This is my first msg to the list. New to Spark, and trying to run the SparkPi 
example shown in the Cloudera documentation.  We have Cloudera 5.5.1 running on 
a small cluster at our lab, with Spark 1.5.

My trial invocation is given below. The output that I get *says* that I 
"SUCCEEDED" at the end. But - I don't get any screen output on the value of pi. 
I also tried a SecondarySort Spark program  that I compiled and jarred from Dr. 
Parsian's Data Algorithms book. That program  failed. So - I am focusing on 
getting SparkPi to work properly, to get started. Can somebody look at the 
screen output that I cut-and-pasted below and infer what I might be doing wrong?

Am I forgetting to set one or more environment variables? Or not setting such 
properly?

Here is the CLASSPATH value that I set:

CLASSPATH=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:/people/rtaylor/SparkWork/DataAlgUtils

Here is the settings of other environment variables:

HADOOP_HOME=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11
SPARK_HOME=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11
HADOOP_CLASSPATH='/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:$JAVA_HOME/lib/tools.jar'
SPARK_CLASSPATH='/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:$JAVA_HOME/lib/tools.jar:':'/people/rtaylor/SparkWork/DataAlgUtils'

I am not sure that those env vars are properly set (or if even all of them are 
needed). But that's what I'm currently using.

As I said, the invocation below appears to terminate with final status set to 
"SUCCEEDED". But - there is no screen output on the value of pi, which I 
understood would be shown. So - something appears to be going wrong. I went to 
the tracking URL given at the end, but could not access it.

I would very much appreciate some guidance!

-   Ron Taylor

%

INVOCATION:

[rtaylor@bigdatann]$ spark-submit   --class org.apache.spark.examples.SparkPi   
 --master yarn--deploy-mode cluster --name RT_SparkPi 
/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/spark-examples-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
10

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/slf4j-simple-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/livy-assembly.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/avro-tools-1.7.6-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/pig-0.12.0-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/02/12 18:16:59 INFO client.RMProxy: Connecting to ResourceManager at 
bigdatann.ib/172.17.115.18:8032
16/02/12 18:16:59 INFO yarn.Client: Requesting a new application from cluster 
with 15 NodeManagers
16/02/12 18:16:59 INFO yarn.Client: Verifying our application has not requested 
more than the maximum memory capability of the cluster (65536 MB per container)
16/02/12 18:16:59 INFO yarn.Client: Will allocate AM container, with 1408 MB 
memory including 384 MB overhead
16/02/12 18:16:59 INFO yarn.Client: Setting up container launch context for our 
AM
16/02/12 18:16:59 INFO yarn.Client: Setting up the launch environment for our 
AM container
16/02/12 18:16:59 INFO yarn.Client: Preparing resources for our AM container
16/02/12 18:17:00 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
16/02/12 18:17:00 INFO yarn.Client: Uploading resource 
file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
 -> 
hdfs://bigdatann.ib:8020/user/rtaylor/.sparkStaging/application_1454115464826_0070/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
16/02/12 18:17:21 INFO yarn.Client: Uploading resource 
file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/spark-examples-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
 -> 
hdfs://bigdatann.ib:8020/user/rtaylor/.sparkStaging/application_1454115464826_0070/spark-examples-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
16/02/12 18:17:23 INFO yarn.Client: Uploading resource 
file:/tmp/spark-141bf8a4-2f4b-49d3-b041-61070107e4de/__spark_conf__8357851336386157291.zip
 -> 
hdfs://bigdatann.ib:8020/user/rtaylor/.sparkStaging/application_1454115464826_0070/__spark_conf__8357851336386157291.zip
16/02/12 18:17:23 INFO spark.SecurityManag

Re: New to Spark

2015-12-01 Thread Ted Yu
Have you tried the following command ?
REFRESH TABLE 

Cheers



On Tue, Dec 1, 2015 at 1:54 AM, Ashok Kumar 
wrote:

> Hi,
>
> I am new to Spark.
>
> I am trying to use spark-sql with SPARK CREATED and HIVE CREATED tables.
>
> I have successfully made Hive metastore to be used by Spark.
>
> In spark-sql I can see the DDL for Hive tables. However, when I do select
> count(1) from HIVE_TABLE it always returns zero rows.
>
> If I create a table in spark as create table SPARK_TABLE as select * from
> HIVE_TABLE, the table schema is created but no data.
>
> I can then use Hive to do INSET SELECT from HIVE_TABLE into SPARK_TABLE.
> That works.
>
> I can then use spark-sql to query the table.
>
> My questions:
>
>
>1. Is this correct that spark-sql only sees data in spark created
>tables but not any data in Hive tables?
>2. How can I make Spark read data from existing Hive tables.
>
>
>
> Thanks
>


Re: New to Spark

2015-12-01 Thread fightf...@163.com
Hi,there 
Which version spark in your use case ? You made hive metastore to be used by 
Spark, 
that mean you can run sql queries over the current hive tables , right ? Or you 
just use 
local hive metastore embeded in spark sql side ? I think you need to provide 
more info
for your spark sql and hive config, that would help to locate root cause for 
the problem.

Best,
Sun.



fightf...@163.com
 
From: Ashok Kumar
Date: 2015-12-01 18:54
To: user@spark.apache.org
Subject: New to Spark

Hi,

I am new to Spark.

I am trying to use spark-sql with SPARK CREATED and HIVE CREATED tables.

I have successfully made Hive metastore to be used by Spark.

In spark-sql I can see the DDL for Hive tables. However, when I do select 
count(1) from HIVE_TABLE it always returns zero rows.

If I create a table in spark as create table SPARK_TABLE as select * from 
HIVE_TABLE, the table schema is created but no data.

I can then use Hive to do INSET SELECT from HIVE_TABLE into SPARK_TABLE. That 
works.

I can then use spark-sql to query the table.

My questions:

Is this correct that spark-sql only sees data in spark created tables but not 
any data in Hive tables?
How can I make Spark read data from existing Hive tables.


Thanks




New to Spark

2015-12-01 Thread Ashok Kumar

  Hi,
I am new to Spark.
I am trying to use spark-sql with SPARK CREATED and HIVE CREATED tables.
I have successfully made Hive metastore to be used by Spark.
In spark-sql I can see the DDL for Hive tables. However, when I do select 
count(1) from HIVE_TABLE it always returns zero rows.
If I create a table in spark as create table SPARK_TABLE as select * from 
HIVE_TABLE, the table schema is created but no data.
I can then use Hive to do INSET SELECT from HIVE_TABLE into SPARK_TABLE. That 
works.
I can then use spark-sql to query the table.
My questions:
   
   - Is this correct that spark-sql only sees data in spark created tables but 
not any data in Hive tables?
   - How can I make Spark read data from existing Hive tables.


Thanks

   

New to Spark

2015-12-01 Thread Ashok Kumar
 Hi,
I am new to Spark.
I am trying to use spark-sql with SPARK CREATED and HIVE CREATED tables.
I have successfully made Hive metastore to be used by Spark.
In spark-sql I can see the DDL for Hive tables. However, when I do select 
count(1) from HIVE_TABLE it always returns zero rows.
If I create a table in spark as create table SPARK_TABLE as select * from 
HIVE_TABLE, the table schema is created but no data.
I can then use Hive to do INSET SELECT from HIVE_TABLE into SPARK_TABLE. That 
works.
I can then use spark-sql to query the table.
My questions:
   
   - Is this correct that spark-sql only sees data in spark created tables but 
not any data in Hive tables?
   - How can I make Spark read data from existing Hive tables.


Thanks

Re: New to Spark - Paritioning Question

2015-09-09 Thread Richard Marscher
Ah I see. In that case, the groupByKey function does guarantee every key is
on exactly one partition matched with the aggregated data. This can be
improved depending on what you want to do after. Group by key only
aggregates the data after shipping it across the cluster. Meanwhile, using
reduceByKey will do aggregation on each node first, then ship those results
to the final node and partition to finalize the aggregation there. If that
makes sense.

So say Node 1 has pairs: (a, 1), (b, 2), (b, 6)
Node 2 has pairs: (a, 2), (a,3), (b, 4)

group by would say send both a pair and b pairs across the network. If you
did reduce with the aggregate of sum then you'd expect it to ship (b, 8)
from Node 1 or (a, 5) from Node 2 since it did the local aggregation first.

You are correct that doing something with expensive side-effects like
writing to a database (connections and network + I/O) is best done with the
mapPartitions or foreachPartition type of functions on RDD so you can share
a database connection and also potentially do things like batch statements.


On Tue, Sep 8, 2015 at 7:37 PM, Mike Wright  wrote:

> Thanks for the response!
>
> Well, in retrospect each partition doesn't need to be restricted to a
> single key. But, I cannot have values associated with a key span partitions
> since they all need to be processed together for a key to facilitate
> cumulative calcs. So provided an individual key has all its values in a
> single partition, I'm OK.
>
> Additionally, the values will be written to the database, and from what I
> have read doing this at the partition level is the best compromise between
> 1) Writing the calculated values for each key (lots of connect/disconnects)
> and collecting them all at the end and writing them all at once.
>
> I am using a groupBy against the filtered RDD the get the grouping I want,
> but apparently this may not be the most efficient way, and it seems that
> everything is always in a single partition under this scenario.
>
>
> ___
>
> *Mike Wright*
> Principal Architect, Software Engineering
>
> SNL Financial LC
> 434-951-7816 *p*
> 434-244-4466 *f*
> 540-470-0119 *m*
>
> mwri...@snl.com
>
> On Tue, Sep 8, 2015 at 5:38 PM, Richard Marscher  > wrote:
>
>> That seems like it could work, although I don't think `partitionByKey` is
>> a thing, at least for RDD. You might be able to merge step #2 and step #3
>> into one step by using the `reduceByKey` function signature that takes in a
>> Partitioner implementation.
>>
>> def reduceByKey(partitioner: Partitioner
>> <http://spark.apache.org/docs/latest/api/scala/org/apache/spark/Partitioner.html>
>> , func: (V, V) ⇒ V): RDD
>> <http://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html>
>> [(K, V)]
>>
>> Merge the values for each key using an associative reduce function. This
>> will also perform the merging locally on each mapper before sending results
>> to a reducer, similarly to a "combiner" in MapReduce.
>>
>> The tricky part might be getting the partitioner to know about the number
>> of partitions, which I think it needs to know upfront in `abstract def
>> numPartitions: Int`. The `HashPartitioner` for example takes in the
>> number as a constructor argument, maybe you could use that with an upper
>> bound size if you don't mind empty partitions. Otherwise you might have to
>> mess around to extract the exact number of keys if it's not readily
>> available.
>>
>> Aside: what is the requirement to have each partition only contain the
>> data related to one key?
>>
>> On Fri, Sep 4, 2015 at 11:06 AM, mmike87  wrote:
>>
>>> Hello, I am new to Apache Spark and this is my company's first Spark
>>> project.
>>> Essentially, we are calculating models dealing with Mining data using
>>> Spark.
>>>
>>> I am holding all the source data in a persisted RDD that we will refresh
>>> periodically. When a "scenario" is passed to the Spark job (we're using
>>> Job
>>> Server) the persisted RDD is filtered to the relevant mines. For
>>> example, we
>>> may want all mines in Chile and the 1990-2015 data for each.
>>>
>>> Many of the calculations are cumulative, that is when we apply user-input
>>> "adjustment factors" to a value, we also need the "flexed" value we
>>> calculated for that mine previously.
>>>
>>> To ensure that this works, the idea if to:
>>>
>>> 1) Filter the superset to relevant mines (done)
>>> 2) Group the subset by the unique identifier for the mine. So, a gro

Re: I am very new to Spark. I have a very basic question. I have an array of values: listofECtokens: Array[String] = Array(EC-17A5206955089011B, EC-17A5206955089011A) I want to filter an RDD for all o

2015-09-09 Thread Ted Yu
Prachicsa:
If the number of EC tokens is high, please consider using a set instead of 
array for better lookup performance. 

BTW use short, descriptive subject for future emails. 



> On Sep 9, 2015, at 3:13 AM, Akhil Das  wrote:
> 
> Try this:
> 
> val tocks = Array("EC-17A5206955089011B","EC-17A5206955089011A")
> 
> val rddAll = sc.parallelize(List("This contains EC-17A5206955089011B","This 
> doesnt"))
> 
> rddAll.filter(line => {
> var found = false
> for(item <- tocks){
>if(line.contains(item)) found = true
> }
>found
>   }).collect()
> 
> 
> Output:
> res8: Array[String] = Array(This contains EC-17A5206955089011B)
> 
> Thanks
> Best Regards
> 
>> On Wed, Sep 9, 2015 at 3:25 PM, prachicsa  wrote:
>> 
>> 
>> I am very new to Spark.
>> 
>> I have a very basic question. I have an array of values:
>> 
>> listofECtokens: Array[String] = Array(EC-17A5206955089011B,
>> EC-17A5206955089011A)
>> 
>> I want to filter an RDD for all of these token values. I tried the following
>> way:
>> 
>> val ECtokens = for (token <- listofECtokens) rddAll.filter(line =>
>> line.contains(token))
>> 
>> Output:
>> 
>> ECtokens: Unit = ()
>> 
>> I got an empty Unit even when there are records with these tokens. What am I
>> doing wrong?
>> 
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/I-am-very-new-to-Spark-I-have-a-very-basic-question-I-have-an-array-of-values-listofECtokens-Array-S-tp24617.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
> 


Re: I am very new to Spark. I have a very basic question. I have an array of values: listofECtokens: Array[String] = Array(EC-17A5206955089011B, EC-17A5206955089011A) I want to filter an RDD for all o

2015-09-09 Thread Akhil Das
Try this:

val tocks = Array("EC-17A5206955089011B","EC-17A5206955089011A")

val rddAll = sc.parallelize(List("This contains EC-17A5206955089011B","This
doesnt"))

rddAll.filter(line => {
 var found = false
 for(item <- tocks){
if(line.contains(item)) found = true
 }
found
  }).collect()


Output:
res8: Array[String] = Array(This contains EC-17A5206955089011B)

Thanks
Best Regards

On Wed, Sep 9, 2015 at 3:25 PM, prachicsa  wrote:

>
>
> I am very new to Spark.
>
> I have a very basic question. I have an array of values:
>
> listofECtokens: Array[String] = Array(EC-17A5206955089011B,
> EC-17A5206955089011A)
>
> I want to filter an RDD for all of these token values. I tried the
> following
> way:
>
> val ECtokens = for (token <- listofECtokens) rddAll.filter(line =>
> line.contains(token))
>
> Output:
>
> ECtokens: Unit = ()
>
> I got an empty Unit even when there are records with these tokens. What am
> I
> doing wrong?
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/I-am-very-new-to-Spark-I-have-a-very-basic-question-I-have-an-array-of-values-listofECtokens-Array-S-tp24617.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


I am very new to Spark. I have a very basic question. I have an array of values: listofECtokens: Array[String] = Array(EC-17A5206955089011B, EC-17A5206955089011A) I want to filter an RDD for all of

2015-09-09 Thread prachicsa


I am very new to Spark.

I have a very basic question. I have an array of values:

listofECtokens: Array[String] = Array(EC-17A5206955089011B,
EC-17A5206955089011A)

I want to filter an RDD for all of these token values. I tried the following
way:

val ECtokens = for (token <- listofECtokens) rddAll.filter(line =>
line.contains(token))

Output:

ECtokens: Unit = ()

I got an empty Unit even when there are records with these tokens. What am I
doing wrong?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/I-am-very-new-to-Spark-I-have-a-very-basic-question-I-have-an-array-of-values-listofECtokens-Array-S-tp24617.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: New to Spark - Paritioning Question

2015-09-08 Thread Mike Wright
Thanks for the response!

Well, in retrospect each partition doesn't need to be restricted to a
single key. But, I cannot have values associated with a key span partitions
since they all need to be processed together for a key to facilitate
cumulative calcs. So provided an individual key has all its values in a
single partition, I'm OK.

Additionally, the values will be written to the database, and from what I
have read doing this at the partition level is the best compromise between
1) Writing the calculated values for each key (lots of connect/disconnects)
and collecting them all at the end and writing them all at once.

I am using a groupBy against the filtered RDD the get the grouping I want,
but apparently this may not be the most efficient way, and it seems that
everything is always in a single partition under this scenario.


___

*Mike Wright*
Principal Architect, Software Engineering

SNL Financial LC
434-951-7816 *p*
434-244-4466 *f*
540-470-0119 *m*

mwri...@snl.com

On Tue, Sep 8, 2015 at 5:38 PM, Richard Marscher 
wrote:

> That seems like it could work, although I don't think `partitionByKey` is
> a thing, at least for RDD. You might be able to merge step #2 and step #3
> into one step by using the `reduceByKey` function signature that takes in a
> Partitioner implementation.
>
> def reduceByKey(partitioner: Partitioner
> <http://spark.apache.org/docs/latest/api/scala/org/apache/spark/Partitioner.html>
> , func: (V, V) ⇒ V): RDD
> <http://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html>
> [(K, V)]
>
> Merge the values for each key using an associative reduce function. This
> will also perform the merging locally on each mapper before sending results
> to a reducer, similarly to a "combiner" in MapReduce.
>
> The tricky part might be getting the partitioner to know about the number
> of partitions, which I think it needs to know upfront in `abstract def
> numPartitions: Int`. The `HashPartitioner` for example takes in the
> number as a constructor argument, maybe you could use that with an upper
> bound size if you don't mind empty partitions. Otherwise you might have to
> mess around to extract the exact number of keys if it's not readily
> available.
>
> Aside: what is the requirement to have each partition only contain the
> data related to one key?
>
> On Fri, Sep 4, 2015 at 11:06 AM, mmike87  wrote:
>
>> Hello, I am new to Apache Spark and this is my company's first Spark
>> project.
>> Essentially, we are calculating models dealing with Mining data using
>> Spark.
>>
>> I am holding all the source data in a persisted RDD that we will refresh
>> periodically. When a "scenario" is passed to the Spark job (we're using
>> Job
>> Server) the persisted RDD is filtered to the relevant mines. For example,
>> we
>> may want all mines in Chile and the 1990-2015 data for each.
>>
>> Many of the calculations are cumulative, that is when we apply user-input
>> "adjustment factors" to a value, we also need the "flexed" value we
>> calculated for that mine previously.
>>
>> To ensure that this works, the idea if to:
>>
>> 1) Filter the superset to relevant mines (done)
>> 2) Group the subset by the unique identifier for the mine. So, a group may
>> be all the rows for mine "A" for 1990-2015
>> 3) I then want to ensure that the RDD is partitioned by the Mine
>> Identifier
>> (and Integer).
>>
>> It's step 3 that is confusing me. I suspect it's very easy ... do I simply
>> use PartitionByKey?
>>
>> We're using Java if that makes any difference.
>>
>> Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/New-to-Spark-Paritioning-Question-tp24580.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> *Richard Marscher*
> Software Engineer
> Localytics
> Localytics.com <http://localytics.com/> | Our Blog
> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
> Facebook <http://facebook.com/localytics> | LinkedIn
> <http://www.linkedin.com/company/1148792?trk=tyah>
>


Re: New to Spark - Paritioning Question

2015-09-08 Thread Richard Marscher
That seems like it could work, although I don't think `partitionByKey` is a
thing, at least for RDD. You might be able to merge step #2 and step #3
into one step by using the `reduceByKey` function signature that takes in a
Partitioner implementation.

def reduceByKey(partitioner: Partitioner
<http://spark.apache.org/docs/latest/api/scala/org/apache/spark/Partitioner.html>
, func: (V, V) ⇒ V): RDD
<http://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html>
[(K, V)]

Merge the values for each key using an associative reduce function. This
will also perform the merging locally on each mapper before sending results
to a reducer, similarly to a "combiner" in MapReduce.

The tricky part might be getting the partitioner to know about the number
of partitions, which I think it needs to know upfront in `abstract def
numPartitions: Int`. The `HashPartitioner` for example takes in the number
as a constructor argument, maybe you could use that with an upper bound
size if you don't mind empty partitions. Otherwise you might have to mess
around to extract the exact number of keys if it's not readily available.

Aside: what is the requirement to have each partition only contain the data
related to one key?

On Fri, Sep 4, 2015 at 11:06 AM, mmike87  wrote:

> Hello, I am new to Apache Spark and this is my company's first Spark
> project.
> Essentially, we are calculating models dealing with Mining data using
> Spark.
>
> I am holding all the source data in a persisted RDD that we will refresh
> periodically. When a "scenario" is passed to the Spark job (we're using Job
> Server) the persisted RDD is filtered to the relevant mines. For example,
> we
> may want all mines in Chile and the 1990-2015 data for each.
>
> Many of the calculations are cumulative, that is when we apply user-input
> "adjustment factors" to a value, we also need the "flexed" value we
> calculated for that mine previously.
>
> To ensure that this works, the idea if to:
>
> 1) Filter the superset to relevant mines (done)
> 2) Group the subset by the unique identifier for the mine. So, a group may
> be all the rows for mine "A" for 1990-2015
> 3) I then want to ensure that the RDD is partitioned by the Mine Identifier
> (and Integer).
>
> It's step 3 that is confusing me. I suspect it's very easy ... do I simply
> use PartitionByKey?
>
> We're using Java if that makes any difference.
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/New-to-Spark-Paritioning-Question-tp24580.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
*Richard Marscher*
Software Engineer
Localytics
Localytics.com <http://localytics.com/> | Our Blog
<http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
Facebook <http://facebook.com/localytics> | LinkedIn
<http://www.linkedin.com/company/1148792?trk=tyah>


New to Spark - Paritioning Question

2015-09-04 Thread mmike87
Hello, I am new to Apache Spark and this is my company's first Spark project.
Essentially, we are calculating models dealing with Mining data using Spark.

I am holding all the source data in a persisted RDD that we will refresh
periodically. When a "scenario" is passed to the Spark job (we're using Job
Server) the persisted RDD is filtered to the relevant mines. For example, we
may want all mines in Chile and the 1990-2015 data for each.

Many of the calculations are cumulative, that is when we apply user-input
"adjustment factors" to a value, we also need the "flexed" value we
calculated for that mine previously. 

To ensure that this works, the idea if to:

1) Filter the superset to relevant mines (done)
2) Group the subset by the unique identifier for the mine. So, a group may
be all the rows for mine "A" for 1990-2015
3) I then want to ensure that the RDD is partitioned by the Mine Identifier
(and Integer).

It's step 3 that is confusing me. I suspect it's very easy ... do I simply
use PartitionByKey?

We're using Java if that makes any difference.

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/New-to-Spark-Paritioning-Question-tp24580.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: NEW to spark and sparksql

2014-11-20 Thread Michael Armbrust
I believe functions like sc.textFile will also accept paths with globs for
example "/data/*/" which would read all the directories into a single RDD.
Under the covers I think it is just using Hadoop's FileInputFormat, in case
you want to google for the full list of supported syntax.

On Thu, Nov 20, 2014 at 7:27 AM, Sam Flint  wrote:

> So you are saying to query an entire day of data I would need to create
> one RDD for every hour and then union them into one RDD.  After I have the
> one RDD I would be able to query for a=2 throughout the entire day.
> Please correct me if I am wrong.
>
> Thanks
>
> On Wed, Nov 19, 2014 at 5:53 PM, Michael Armbrust 
> wrote:
>
>> I would use just textFile unless you actually need a guarantee that you
>> will be seeing a whole file at time (textFile splits on new lines).
>>
>> RDDs are immutable, so you cannot add data to them.  You can however
>> union two RDDs, returning a new RDD that contains all the data.
>>
>> On Wed, Nov 19, 2014 at 2:46 PM, Sam Flint 
>> wrote:
>>
>>> Michael,
>>> Thanks for your help.   I found a wholeTextFiles() that I can use to
>>> import all files in a directory.  I believe this would be the case if all
>>> the files existed in the same directory.  Currently the files come in by
>>> the hour and are in a format somewhat like this ../2014/10/01/00/filename
>>> and there is one file per hour.
>>>
>>> Do I create an RDD and add to it? Is that possible?  My example query
>>> would be select count(*) from (entire day RDD) where a=2.  "a" would exist
>>> in all files multiple times with multiple values.
>>>
>>> I don't see in any documentation how to import a file create an RDD then
>>> import another file into that RDD.   kinda like in mysql when you create a
>>> table import data then import more data.  This may be my ignorance because
>>> I am not that familiar with spark, but I would expect to import data into a
>>> single RDD before performing analytics on it.
>>>
>>> Thank you for your time and help on this.
>>>
>>>
>>> P.S. I am using python if that makes a difference.
>>>
>>> On Wed, Nov 19, 2014 at 4:45 PM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
>>>> In general you should be able to read full directories of files as a
>>>> single RDD/SchemaRDD.  For documentation I'd suggest the programming
>>>> guides:
>>>>
>>>> http://spark.apache.org/docs/latest/quick-start.html
>>>> http://spark.apache.org/docs/latest/sql-programming-guide.html
>>>>
>>>> For Avro in particular, I have been working on a library for Spark
>>>> SQL.  Its very early code, but you can find it here:
>>>> https://github.com/databricks/spark-avro
>>>>
>>>> Bug reports welcome!
>>>>
>>>> Michael
>>>>
>>>> On Wed, Nov 19, 2014 at 1:02 PM, Sam Flint 
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am new to spark.  I have began to read to understand sparks RDD
>>>>> files as well as SparkSQL.  My question is more on how to build out the 
>>>>> RDD
>>>>> files and best practices.   I have data that is broken down by hour into
>>>>> files on HDFS in avro format.   Do I need to create a separate RDD for 
>>>>> each
>>>>> file? or using SparkSQL a separate SchemaRDD?
>>>>>
>>>>> I want to be able to pull lets say an entire day of data into spark
>>>>> and run some analytics on it.  Then possibly a week, a month, etc.
>>>>>
>>>>>
>>>>> If there is documentation on this procedure or best practives for
>>>>> building RDD's please point me at them.
>>>>>
>>>>> Thanks for your time,
>>>>>Sam
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> *MAGNE**+**I**C*
>>>
>>> *Sam Flint* | *Lead Developer, Data Analytics*
>>>
>>>
>>>
>>
>
>
> --
>
> *MAGNE**+**I**C*
>
> *Sam Flint* | *Lead Developer, Data Analytics*
>
>
>


Re: NEW to spark and sparksql

2014-11-20 Thread Sam Flint
So you are saying to query an entire day of data I would need to create one
RDD for every hour and then union them into one RDD.  After I have the one
RDD I would be able to query for a=2 throughout the entire day.   Please
correct me if I am wrong.

Thanks

On Wed, Nov 19, 2014 at 5:53 PM, Michael Armbrust 
wrote:

> I would use just textFile unless you actually need a guarantee that you
> will be seeing a whole file at time (textFile splits on new lines).
>
> RDDs are immutable, so you cannot add data to them.  You can however union
> two RDDs, returning a new RDD that contains all the data.
>
> On Wed, Nov 19, 2014 at 2:46 PM, Sam Flint  wrote:
>
>> Michael,
>> Thanks for your help.   I found a wholeTextFiles() that I can use to
>> import all files in a directory.  I believe this would be the case if all
>> the files existed in the same directory.  Currently the files come in by
>> the hour and are in a format somewhat like this ../2014/10/01/00/filename
>> and there is one file per hour.
>>
>> Do I create an RDD and add to it? Is that possible?  My example query
>> would be select count(*) from (entire day RDD) where a=2.  "a" would exist
>> in all files multiple times with multiple values.
>>
>> I don't see in any documentation how to import a file create an RDD then
>> import another file into that RDD.   kinda like in mysql when you create a
>> table import data then import more data.  This may be my ignorance because
>> I am not that familiar with spark, but I would expect to import data into a
>> single RDD before performing analytics on it.
>>
>> Thank you for your time and help on this.
>>
>>
>> P.S. I am using python if that makes a difference.
>>
>> On Wed, Nov 19, 2014 at 4:45 PM, Michael Armbrust > > wrote:
>>
>>> In general you should be able to read full directories of files as a
>>> single RDD/SchemaRDD.  For documentation I'd suggest the programming
>>> guides:
>>>
>>> http://spark.apache.org/docs/latest/quick-start.html
>>> http://spark.apache.org/docs/latest/sql-programming-guide.html
>>>
>>> For Avro in particular, I have been working on a library for Spark SQL.
>>> Its very early code, but you can find it here:
>>> https://github.com/databricks/spark-avro
>>>
>>> Bug reports welcome!
>>>
>>> Michael
>>>
>>> On Wed, Nov 19, 2014 at 1:02 PM, Sam Flint 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am new to spark.  I have began to read to understand sparks RDD
>>>> files as well as SparkSQL.  My question is more on how to build out the RDD
>>>> files and best practices.   I have data that is broken down by hour into
>>>> files on HDFS in avro format.   Do I need to create a separate RDD for each
>>>> file? or using SparkSQL a separate SchemaRDD?
>>>>
>>>> I want to be able to pull lets say an entire day of data into spark and
>>>> run some analytics on it.  Then possibly a week, a month, etc.
>>>>
>>>>
>>>> If there is documentation on this procedure or best practives for
>>>> building RDD's please point me at them.
>>>>
>>>> Thanks for your time,
>>>>Sam
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>>
>> *MAGNE**+**I**C*
>>
>> *Sam Flint* | *Lead Developer, Data Analytics*
>>
>>
>>
>


-- 

*MAGNE**+**I**C*

*Sam Flint* | *Lead Developer, Data Analytics*


Re: NEW to spark and sparksql

2014-11-19 Thread Michael Armbrust
I would use just textFile unless you actually need a guarantee that you
will be seeing a whole file at time (textFile splits on new lines).

RDDs are immutable, so you cannot add data to them.  You can however union
two RDDs, returning a new RDD that contains all the data.

On Wed, Nov 19, 2014 at 2:46 PM, Sam Flint  wrote:

> Michael,
> Thanks for your help.   I found a wholeTextFiles() that I can use to
> import all files in a directory.  I believe this would be the case if all
> the files existed in the same directory.  Currently the files come in by
> the hour and are in a format somewhat like this ../2014/10/01/00/filename
> and there is one file per hour.
>
> Do I create an RDD and add to it? Is that possible?  My example query
> would be select count(*) from (entire day RDD) where a=2.  "a" would exist
> in all files multiple times with multiple values.
>
> I don't see in any documentation how to import a file create an RDD then
> import another file into that RDD.   kinda like in mysql when you create a
> table import data then import more data.  This may be my ignorance because
> I am not that familiar with spark, but I would expect to import data into a
> single RDD before performing analytics on it.
>
> Thank you for your time and help on this.
>
>
> P.S. I am using python if that makes a difference.
>
> On Wed, Nov 19, 2014 at 4:45 PM, Michael Armbrust 
> wrote:
>
>> In general you should be able to read full directories of files as a
>> single RDD/SchemaRDD.  For documentation I'd suggest the programming
>> guides:
>>
>> http://spark.apache.org/docs/latest/quick-start.html
>> http://spark.apache.org/docs/latest/sql-programming-guide.html
>>
>> For Avro in particular, I have been working on a library for Spark SQL.
>> Its very early code, but you can find it here:
>> https://github.com/databricks/spark-avro
>>
>> Bug reports welcome!
>>
>> Michael
>>
>> On Wed, Nov 19, 2014 at 1:02 PM, Sam Flint 
>> wrote:
>>
>>> Hi,
>>>
>>> I am new to spark.  I have began to read to understand sparks RDD
>>> files as well as SparkSQL.  My question is more on how to build out the RDD
>>> files and best practices.   I have data that is broken down by hour into
>>> files on HDFS in avro format.   Do I need to create a separate RDD for each
>>> file? or using SparkSQL a separate SchemaRDD?
>>>
>>> I want to be able to pull lets say an entire day of data into spark and
>>> run some analytics on it.  Then possibly a week, a month, etc.
>>>
>>>
>>> If there is documentation on this procedure or best practives for
>>> building RDD's please point me at them.
>>>
>>> Thanks for your time,
>>>Sam
>>>
>>>
>>>
>>>
>>
>
>
> --
>
> *MAGNE**+**I**C*
>
> *Sam Flint* | *Lead Developer, Data Analytics*
>
>
>


Re: NEW to spark and sparksql

2014-11-19 Thread Michael Armbrust
In general you should be able to read full directories of files as a single
RDD/SchemaRDD.  For documentation I'd suggest the programming guides:

http://spark.apache.org/docs/latest/quick-start.html
http://spark.apache.org/docs/latest/sql-programming-guide.html

For Avro in particular, I have been working on a library for Spark SQL.
Its very early code, but you can find it here:
https://github.com/databricks/spark-avro

Bug reports welcome!

Michael

On Wed, Nov 19, 2014 at 1:02 PM, Sam Flint  wrote:

> Hi,
>
>     I am new to spark.  I have began to read to understand sparks RDD
> files as well as SparkSQL.  My question is more on how to build out the RDD
> files and best practices.   I have data that is broken down by hour into
> files on HDFS in avro format.   Do I need to create a separate RDD for each
> file? or using SparkSQL a separate SchemaRDD?
>
> I want to be able to pull lets say an entire day of data into spark and
> run some analytics on it.  Then possibly a week, a month, etc.
>
>
> If there is documentation on this procedure or best practives for building
> RDD's please point me at them.
>
> Thanks for your time,
>Sam
>
>
>
>


NEW to spark and sparksql

2014-11-19 Thread Sam Flint
Hi,

I am new to spark.  I have began to read to understand sparks RDD files
as well as SparkSQL.  My question is more on how to build out the RDD files
and best practices.   I have data that is broken down by hour into files on
HDFS in avro format.   Do I need to create a separate RDD for each file? or
using SparkSQL a separate SchemaRDD?

I want to be able to pull lets say an entire day of data into spark and run
some analytics on it.  Then possibly a week, a month, etc.


If there is documentation on this procedure or best practives for building
RDD's please point me at them.

Thanks for your time,
   Sam