New to spark 2.2.1 - Problem with finding tables between different metastore db
All, I am new to Spark 2.2.1. I have a single node cluster and also have enabled thriftserver for my Tableau application to connect to my persisted table. I feel that the spark cluster metastore is different from the thrift-server metastore. If this assumption is valid, what do I need to do to make it a universal metastore? Once I have fixed this issue, I will enable another node in my cluster Thanks, Subhajit
Re: New to spark.
Hi Anirudh, All types of contributions are welcome, from code to documentation. Please check out the page at https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark for some info, specifically keep a watch out for starter JIRAs here https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20labels%20%3D%20Starter%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened) . On Wed, Sep 28, 2016 at 9:11 AM, Anirudh Muhnot wrote: > Hello everyone, I'm Anirudh. I'm fairly new to spark as I've done an > online specialisation from UC Berkeley. I know how to code in Python but > have little to no idea about Scala. I want to contribute to Spark, Where do > I start and how? I'm reading the pull requests at Git Hub but I'm barley > able to understand them. Can anyone help? Thank you. > Sent from my iPhone >
New to spark.
Hello everyone, I'm Anirudh. I'm fairly new to spark as I've done an online specialisation from UC Berkeley. I know how to code in Python but have little to no idea about Scala. I want to contribute to Spark, Where do I start and how? I'm reading the pull requests at Git Hub but I'm barley able to understand them. Can anyone help? Thank you. Sent from my iPhone
Re: new to Spark - trying to get a basic example to run - could use some help
Maybe a comment should be added to SparkPi.scala, telling user to look for the value in stdout log ? Cheers On Sat, Feb 13, 2016 at 3:12 AM, Chandeep Singh wrote: > Try looking at stdout logs. I ran the exactly same job as you and did not > see anything on the console as well but found it in stdout. > > [csingh@<> ~]$ spark-submit --class org.apache.spark.examples.SparkPi > --master yarn--deploy-mode cluster--name RT_SparkPi > /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/spark-examples-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar >10 > > Log Type: stdout > > Log Upload Time: Sat Feb 13 11:00:08 + 2016 > > Log Length: 23 > > Pi is roughly 3.140224 > > > Hope that helps! > > > On Sat, Feb 13, 2016 at 3:14 AM, Taylor, Ronald C > wrote: > >> Hello folks, >> >> This is my first msg to the list. New to Spark, and trying to run the >> SparkPi example shown in the Cloudera documentation. We have Cloudera >> 5.5.1 running on a small cluster at our lab, with Spark 1.5. >> >> My trial invocation is given below. The output that I get **says** that >> I “SUCCEEDED” at the end. But – I don’t get any screen output on the value >> of pi. I also tried a SecondarySort Spark program that I compiled and >> jarred from Dr. Parsian’s Data Algorithms book. That program failed. So – >> I am focusing on getting SparkPi to work properly, to get started. Can >> somebody look at the screen output that I cut-and-pasted below and infer >> what I might be doing wrong? >> >> Am I forgetting to set one or more environment variables? Or not setting >> such properly? >> >> Here is the CLASSPATH value that I set: >> >> >> CLASSPATH=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:/people/rtaylor/SparkWork/DataAlgUtils >> >> Here is the settings of other environment variables: >> >> HADOOP_HOME=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11 >> SPARK_HOME=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11 >> >> HADOOP_CLASSPATH='/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:$JAVA_HOME/lib/tools.jar' >> >> SPARK_CLASSPATH='/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:$JAVA_HOME/lib/tools.jar:':'/people/rtaylor/SparkWork/DataAlgUtils' >> >> I am not sure that those env vars are properly set (or if even all of >> them are needed). But that’s what I’m currently using. >> >> As I said, the invocation below appears to terminate with final status >> set to “SUCCEEDED”. But – there is no screen output on the value of pi, >> which I understood would be shown. So – something appears to be going >> wrong. I went to the tracking URL given at the end, but could not access it. >> >> I would very much appreciate some guidance! >> >> >>- Ron Taylor >> >> >> % >> >> INVOCATION: >> >> [rtaylor@bigdatann]$ spark-submit --class >> org.apache.spark.examples.SparkPi--master yarn--deploy-mode >> cluster--name RT_SparkPi >> /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/spark-examples-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar >>10 >> >> SLF4J: Class path contains multiple SLF4J bindings. >> SLF4J: Found binding in >> [jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] >> SLF4J: Found binding in >> [jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/slf4j-simple-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] >> SLF4J: Found binding in >> [jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/livy-assembly.jar!/org/slf4j/impl/StaticLoggerBinder.class] >> SLF4J: Found binding in >> [jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/avro-tools-1.7.6-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] >> SLF4J: Found binding in >> [jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/pig-0.12.0-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] >> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an >> explanation. >> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] >> 16/02/12 18:16:59 INFO client.RMProxy: Connecting to ResourceManager at >> bigdatann.ib/172.17.115.18:8032 >> 16/02/12 18:16:59 INFO yarn.Client: Requesting a new application from >> cluster with 15 NodeManagers >> 16/02/12 18:16:59 INFO yarn.Client: Verifying our application has not >> requested more than the maximum memory capability of the cluster (65536 MB >> per container) >> 16/02/12
Re: new to Spark - trying to get a basic example to run - could use some help
Try looking at stdout logs. I ran the exactly same job as you and did not see anything on the console as well but found it in stdout. [csingh@<> ~]$ spark-submit --class org.apache.spark.examples.SparkPi --master yarn--deploy-mode cluster--name RT_SparkPi /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/spark-examples-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar 10 Log Type: stdout Log Upload Time: Sat Feb 13 11:00:08 + 2016 Log Length: 23 Pi is roughly 3.140224 Hope that helps! On Sat, Feb 13, 2016 at 3:14 AM, Taylor, Ronald C wrote: > Hello folks, > > This is my first msg to the list. New to Spark, and trying to run the > SparkPi example shown in the Cloudera documentation. We have Cloudera > 5.5.1 running on a small cluster at our lab, with Spark 1.5. > > My trial invocation is given below. The output that I get **says** that I > “SUCCEEDED” at the end. But – I don’t get any screen output on the value of > pi. I also tried a SecondarySort Spark program that I compiled and jarred > from Dr. Parsian’s Data Algorithms book. That program failed. So – I am > focusing on getting SparkPi to work properly, to get started. Can somebody > look at the screen output that I cut-and-pasted below and infer what I > might be doing wrong? > > Am I forgetting to set one or more environment variables? Or not setting > such properly? > > Here is the CLASSPATH value that I set: > > > CLASSPATH=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:/people/rtaylor/SparkWork/DataAlgUtils > > Here is the settings of other environment variables: > > HADOOP_HOME=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11 > SPARK_HOME=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11 > > HADOOP_CLASSPATH='/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:$JAVA_HOME/lib/tools.jar' > > SPARK_CLASSPATH='/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:$JAVA_HOME/lib/tools.jar:':'/people/rtaylor/SparkWork/DataAlgUtils' > > I am not sure that those env vars are properly set (or if even all of them > are needed). But that’s what I’m currently using. > > As I said, the invocation below appears to terminate with final status set > to “SUCCEEDED”. But – there is no screen output on the value of pi, which I > understood would be shown. So – something appears to be going wrong. I went > to the tracking URL given at the end, but could not access it. > > I would very much appreciate some guidance! > > >- Ron Taylor > > > % > > INVOCATION: > > [rtaylor@bigdatann]$ spark-submit --class > org.apache.spark.examples.SparkPi--master yarn--deploy-mode > cluster--name RT_SparkPi > /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/spark-examples-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar >10 > > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/slf4j-simple-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/livy-assembly.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/avro-tools-1.7.6-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/pig-0.12.0-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > 16/02/12 18:16:59 INFO client.RMProxy: Connecting to ResourceManager at > bigdatann.ib/172.17.115.18:8032 > 16/02/12 18:16:59 INFO yarn.Client: Requesting a new application from > cluster with 15 NodeManagers > 16/02/12 18:16:59 INFO yarn.Client: Verifying our application has not > requested more than the maximum memory capability of the cluster (65536 MB > per container) > 16/02/12 18:16:59 INFO yarn.Client: Will allocate AM container, with 1408 > MB memory including 384 MB overhead > 16/02/12 18:16:59 INFO yarn.Client: Setting up container launch context > for our AM > 16/02/12 18:16:59 INFO yarn.Client: Setting up the launch environment for > our AM container > 16/02/12 18:16:59 INFO yarn.Client: Preparing resources for our AM > container > 16/02/12 18:17:00 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 16/02/12 18:17:00 INFO yarn.Client:
new to Spark - trying to get a basic example to run - could use some help
Hello folks, This is my first msg to the list. New to Spark, and trying to run the SparkPi example shown in the Cloudera documentation. We have Cloudera 5.5.1 running on a small cluster at our lab, with Spark 1.5. My trial invocation is given below. The output that I get *says* that I "SUCCEEDED" at the end. But - I don't get any screen output on the value of pi. I also tried a SecondarySort Spark program that I compiled and jarred from Dr. Parsian's Data Algorithms book. That program failed. So - I am focusing on getting SparkPi to work properly, to get started. Can somebody look at the screen output that I cut-and-pasted below and infer what I might be doing wrong? Am I forgetting to set one or more environment variables? Or not setting such properly? Here is the CLASSPATH value that I set: CLASSPATH=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:/people/rtaylor/SparkWork/DataAlgUtils Here is the settings of other environment variables: HADOOP_HOME=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11 SPARK_HOME=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11 HADOOP_CLASSPATH='/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:$JAVA_HOME/lib/tools.jar' SPARK_CLASSPATH='/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:$JAVA_HOME/lib/tools.jar:':'/people/rtaylor/SparkWork/DataAlgUtils' I am not sure that those env vars are properly set (or if even all of them are needed). But that's what I'm currently using. As I said, the invocation below appears to terminate with final status set to "SUCCEEDED". But - there is no screen output on the value of pi, which I understood would be shown. So - something appears to be going wrong. I went to the tracking URL given at the end, but could not access it. I would very much appreciate some guidance! - Ron Taylor % INVOCATION: [rtaylor@bigdatann]$ spark-submit --class org.apache.spark.examples.SparkPi --master yarn--deploy-mode cluster --name RT_SparkPi /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/spark-examples-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar 10 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/slf4j-simple-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/livy-assembly.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/avro-tools-1.7.6-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/pig-0.12.0-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 16/02/12 18:16:59 INFO client.RMProxy: Connecting to ResourceManager at bigdatann.ib/172.17.115.18:8032 16/02/12 18:16:59 INFO yarn.Client: Requesting a new application from cluster with 15 NodeManagers 16/02/12 18:16:59 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (65536 MB per container) 16/02/12 18:16:59 INFO yarn.Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead 16/02/12 18:16:59 INFO yarn.Client: Setting up container launch context for our AM 16/02/12 18:16:59 INFO yarn.Client: Setting up the launch environment for our AM container 16/02/12 18:16:59 INFO yarn.Client: Preparing resources for our AM container 16/02/12 18:17:00 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/02/12 18:17:00 INFO yarn.Client: Uploading resource file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar -> hdfs://bigdatann.ib:8020/user/rtaylor/.sparkStaging/application_1454115464826_0070/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar 16/02/12 18:17:21 INFO yarn.Client: Uploading resource file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/spark-examples-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar -> hdfs://bigdatann.ib:8020/user/rtaylor/.sparkStaging/application_1454115464826_0070/spark-examples-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar 16/02/12 18:17:23 INFO yarn.Client: Uploading resource file:/tmp/spark-141bf8a4-2f4b-49d3-b041-61070107e4de/__spark_conf__8357851336386157291.zip -> hdfs://bigdatann.ib:8020/user/rtaylor/.sparkStaging/application_1454115464826_0070/__spark_conf__8357851336386157291.zip 16/02/12 18:17:23 INFO spark.SecurityManag
Re: New to Spark
Have you tried the following command ? REFRESH TABLE Cheers On Tue, Dec 1, 2015 at 1:54 AM, Ashok Kumar wrote: > Hi, > > I am new to Spark. > > I am trying to use spark-sql with SPARK CREATED and HIVE CREATED tables. > > I have successfully made Hive metastore to be used by Spark. > > In spark-sql I can see the DDL for Hive tables. However, when I do select > count(1) from HIVE_TABLE it always returns zero rows. > > If I create a table in spark as create table SPARK_TABLE as select * from > HIVE_TABLE, the table schema is created but no data. > > I can then use Hive to do INSET SELECT from HIVE_TABLE into SPARK_TABLE. > That works. > > I can then use spark-sql to query the table. > > My questions: > > >1. Is this correct that spark-sql only sees data in spark created >tables but not any data in Hive tables? >2. How can I make Spark read data from existing Hive tables. > > > > Thanks >
Re: New to Spark
Hi,there Which version spark in your use case ? You made hive metastore to be used by Spark, that mean you can run sql queries over the current hive tables , right ? Or you just use local hive metastore embeded in spark sql side ? I think you need to provide more info for your spark sql and hive config, that would help to locate root cause for the problem. Best, Sun. fightf...@163.com From: Ashok Kumar Date: 2015-12-01 18:54 To: user@spark.apache.org Subject: New to Spark Hi, I am new to Spark. I am trying to use spark-sql with SPARK CREATED and HIVE CREATED tables. I have successfully made Hive metastore to be used by Spark. In spark-sql I can see the DDL for Hive tables. However, when I do select count(1) from HIVE_TABLE it always returns zero rows. If I create a table in spark as create table SPARK_TABLE as select * from HIVE_TABLE, the table schema is created but no data. I can then use Hive to do INSET SELECT from HIVE_TABLE into SPARK_TABLE. That works. I can then use spark-sql to query the table. My questions: Is this correct that spark-sql only sees data in spark created tables but not any data in Hive tables? How can I make Spark read data from existing Hive tables. Thanks
New to Spark
Hi, I am new to Spark. I am trying to use spark-sql with SPARK CREATED and HIVE CREATED tables. I have successfully made Hive metastore to be used by Spark. In spark-sql I can see the DDL for Hive tables. However, when I do select count(1) from HIVE_TABLE it always returns zero rows. If I create a table in spark as create table SPARK_TABLE as select * from HIVE_TABLE, the table schema is created but no data. I can then use Hive to do INSET SELECT from HIVE_TABLE into SPARK_TABLE. That works. I can then use spark-sql to query the table. My questions: - Is this correct that spark-sql only sees data in spark created tables but not any data in Hive tables? - How can I make Spark read data from existing Hive tables. Thanks
New to Spark
Hi, I am new to Spark. I am trying to use spark-sql with SPARK CREATED and HIVE CREATED tables. I have successfully made Hive metastore to be used by Spark. In spark-sql I can see the DDL for Hive tables. However, when I do select count(1) from HIVE_TABLE it always returns zero rows. If I create a table in spark as create table SPARK_TABLE as select * from HIVE_TABLE, the table schema is created but no data. I can then use Hive to do INSET SELECT from HIVE_TABLE into SPARK_TABLE. That works. I can then use spark-sql to query the table. My questions: - Is this correct that spark-sql only sees data in spark created tables but not any data in Hive tables? - How can I make Spark read data from existing Hive tables. Thanks
Re: New to Spark - Paritioning Question
Ah I see. In that case, the groupByKey function does guarantee every key is on exactly one partition matched with the aggregated data. This can be improved depending on what you want to do after. Group by key only aggregates the data after shipping it across the cluster. Meanwhile, using reduceByKey will do aggregation on each node first, then ship those results to the final node and partition to finalize the aggregation there. If that makes sense. So say Node 1 has pairs: (a, 1), (b, 2), (b, 6) Node 2 has pairs: (a, 2), (a,3), (b, 4) group by would say send both a pair and b pairs across the network. If you did reduce with the aggregate of sum then you'd expect it to ship (b, 8) from Node 1 or (a, 5) from Node 2 since it did the local aggregation first. You are correct that doing something with expensive side-effects like writing to a database (connections and network + I/O) is best done with the mapPartitions or foreachPartition type of functions on RDD so you can share a database connection and also potentially do things like batch statements. On Tue, Sep 8, 2015 at 7:37 PM, Mike Wright wrote: > Thanks for the response! > > Well, in retrospect each partition doesn't need to be restricted to a > single key. But, I cannot have values associated with a key span partitions > since they all need to be processed together for a key to facilitate > cumulative calcs. So provided an individual key has all its values in a > single partition, I'm OK. > > Additionally, the values will be written to the database, and from what I > have read doing this at the partition level is the best compromise between > 1) Writing the calculated values for each key (lots of connect/disconnects) > and collecting them all at the end and writing them all at once. > > I am using a groupBy against the filtered RDD the get the grouping I want, > but apparently this may not be the most efficient way, and it seems that > everything is always in a single partition under this scenario. > > > ___ > > *Mike Wright* > Principal Architect, Software Engineering > > SNL Financial LC > 434-951-7816 *p* > 434-244-4466 *f* > 540-470-0119 *m* > > mwri...@snl.com > > On Tue, Sep 8, 2015 at 5:38 PM, Richard Marscher > wrote: > >> That seems like it could work, although I don't think `partitionByKey` is >> a thing, at least for RDD. You might be able to merge step #2 and step #3 >> into one step by using the `reduceByKey` function signature that takes in a >> Partitioner implementation. >> >> def reduceByKey(partitioner: Partitioner >> <http://spark.apache.org/docs/latest/api/scala/org/apache/spark/Partitioner.html> >> , func: (V, V) ⇒ V): RDD >> <http://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html> >> [(K, V)] >> >> Merge the values for each key using an associative reduce function. This >> will also perform the merging locally on each mapper before sending results >> to a reducer, similarly to a "combiner" in MapReduce. >> >> The tricky part might be getting the partitioner to know about the number >> of partitions, which I think it needs to know upfront in `abstract def >> numPartitions: Int`. The `HashPartitioner` for example takes in the >> number as a constructor argument, maybe you could use that with an upper >> bound size if you don't mind empty partitions. Otherwise you might have to >> mess around to extract the exact number of keys if it's not readily >> available. >> >> Aside: what is the requirement to have each partition only contain the >> data related to one key? >> >> On Fri, Sep 4, 2015 at 11:06 AM, mmike87 wrote: >> >>> Hello, I am new to Apache Spark and this is my company's first Spark >>> project. >>> Essentially, we are calculating models dealing with Mining data using >>> Spark. >>> >>> I am holding all the source data in a persisted RDD that we will refresh >>> periodically. When a "scenario" is passed to the Spark job (we're using >>> Job >>> Server) the persisted RDD is filtered to the relevant mines. For >>> example, we >>> may want all mines in Chile and the 1990-2015 data for each. >>> >>> Many of the calculations are cumulative, that is when we apply user-input >>> "adjustment factors" to a value, we also need the "flexed" value we >>> calculated for that mine previously. >>> >>> To ensure that this works, the idea if to: >>> >>> 1) Filter the superset to relevant mines (done) >>> 2) Group the subset by the unique identifier for the mine. So, a gro
Re: I am very new to Spark. I have a very basic question. I have an array of values: listofECtokens: Array[String] = Array(EC-17A5206955089011B, EC-17A5206955089011A) I want to filter an RDD for all o
Prachicsa: If the number of EC tokens is high, please consider using a set instead of array for better lookup performance. BTW use short, descriptive subject for future emails. > On Sep 9, 2015, at 3:13 AM, Akhil Das wrote: > > Try this: > > val tocks = Array("EC-17A5206955089011B","EC-17A5206955089011A") > > val rddAll = sc.parallelize(List("This contains EC-17A5206955089011B","This > doesnt")) > > rddAll.filter(line => { > var found = false > for(item <- tocks){ >if(line.contains(item)) found = true > } >found > }).collect() > > > Output: > res8: Array[String] = Array(This contains EC-17A5206955089011B) > > Thanks > Best Regards > >> On Wed, Sep 9, 2015 at 3:25 PM, prachicsa wrote: >> >> >> I am very new to Spark. >> >> I have a very basic question. I have an array of values: >> >> listofECtokens: Array[String] = Array(EC-17A5206955089011B, >> EC-17A5206955089011A) >> >> I want to filter an RDD for all of these token values. I tried the following >> way: >> >> val ECtokens = for (token <- listofECtokens) rddAll.filter(line => >> line.contains(token)) >> >> Output: >> >> ECtokens: Unit = () >> >> I got an empty Unit even when there are records with these tokens. What am I >> doing wrong? >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/I-am-very-new-to-Spark-I-have-a-very-basic-question-I-have-an-array-of-values-listofECtokens-Array-S-tp24617.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >
Re: I am very new to Spark. I have a very basic question. I have an array of values: listofECtokens: Array[String] = Array(EC-17A5206955089011B, EC-17A5206955089011A) I want to filter an RDD for all o
Try this: val tocks = Array("EC-17A5206955089011B","EC-17A5206955089011A") val rddAll = sc.parallelize(List("This contains EC-17A5206955089011B","This doesnt")) rddAll.filter(line => { var found = false for(item <- tocks){ if(line.contains(item)) found = true } found }).collect() Output: res8: Array[String] = Array(This contains EC-17A5206955089011B) Thanks Best Regards On Wed, Sep 9, 2015 at 3:25 PM, prachicsa wrote: > > > I am very new to Spark. > > I have a very basic question. I have an array of values: > > listofECtokens: Array[String] = Array(EC-17A5206955089011B, > EC-17A5206955089011A) > > I want to filter an RDD for all of these token values. I tried the > following > way: > > val ECtokens = for (token <- listofECtokens) rddAll.filter(line => > line.contains(token)) > > Output: > > ECtokens: Unit = () > > I got an empty Unit even when there are records with these tokens. What am > I > doing wrong? > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/I-am-very-new-to-Spark-I-have-a-very-basic-question-I-have-an-array-of-values-listofECtokens-Array-S-tp24617.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
I am very new to Spark. I have a very basic question. I have an array of values: listofECtokens: Array[String] = Array(EC-17A5206955089011B, EC-17A5206955089011A) I want to filter an RDD for all of
I am very new to Spark. I have a very basic question. I have an array of values: listofECtokens: Array[String] = Array(EC-17A5206955089011B, EC-17A5206955089011A) I want to filter an RDD for all of these token values. I tried the following way: val ECtokens = for (token <- listofECtokens) rddAll.filter(line => line.contains(token)) Output: ECtokens: Unit = () I got an empty Unit even when there are records with these tokens. What am I doing wrong? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/I-am-very-new-to-Spark-I-have-a-very-basic-question-I-have-an-array-of-values-listofECtokens-Array-S-tp24617.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: New to Spark - Paritioning Question
Thanks for the response! Well, in retrospect each partition doesn't need to be restricted to a single key. But, I cannot have values associated with a key span partitions since they all need to be processed together for a key to facilitate cumulative calcs. So provided an individual key has all its values in a single partition, I'm OK. Additionally, the values will be written to the database, and from what I have read doing this at the partition level is the best compromise between 1) Writing the calculated values for each key (lots of connect/disconnects) and collecting them all at the end and writing them all at once. I am using a groupBy against the filtered RDD the get the grouping I want, but apparently this may not be the most efficient way, and it seems that everything is always in a single partition under this scenario. ___ *Mike Wright* Principal Architect, Software Engineering SNL Financial LC 434-951-7816 *p* 434-244-4466 *f* 540-470-0119 *m* mwri...@snl.com On Tue, Sep 8, 2015 at 5:38 PM, Richard Marscher wrote: > That seems like it could work, although I don't think `partitionByKey` is > a thing, at least for RDD. You might be able to merge step #2 and step #3 > into one step by using the `reduceByKey` function signature that takes in a > Partitioner implementation. > > def reduceByKey(partitioner: Partitioner > <http://spark.apache.org/docs/latest/api/scala/org/apache/spark/Partitioner.html> > , func: (V, V) ⇒ V): RDD > <http://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html> > [(K, V)] > > Merge the values for each key using an associative reduce function. This > will also perform the merging locally on each mapper before sending results > to a reducer, similarly to a "combiner" in MapReduce. > > The tricky part might be getting the partitioner to know about the number > of partitions, which I think it needs to know upfront in `abstract def > numPartitions: Int`. The `HashPartitioner` for example takes in the > number as a constructor argument, maybe you could use that with an upper > bound size if you don't mind empty partitions. Otherwise you might have to > mess around to extract the exact number of keys if it's not readily > available. > > Aside: what is the requirement to have each partition only contain the > data related to one key? > > On Fri, Sep 4, 2015 at 11:06 AM, mmike87 wrote: > >> Hello, I am new to Apache Spark and this is my company's first Spark >> project. >> Essentially, we are calculating models dealing with Mining data using >> Spark. >> >> I am holding all the source data in a persisted RDD that we will refresh >> periodically. When a "scenario" is passed to the Spark job (we're using >> Job >> Server) the persisted RDD is filtered to the relevant mines. For example, >> we >> may want all mines in Chile and the 1990-2015 data for each. >> >> Many of the calculations are cumulative, that is when we apply user-input >> "adjustment factors" to a value, we also need the "flexed" value we >> calculated for that mine previously. >> >> To ensure that this works, the idea if to: >> >> 1) Filter the superset to relevant mines (done) >> 2) Group the subset by the unique identifier for the mine. So, a group may >> be all the rows for mine "A" for 1990-2015 >> 3) I then want to ensure that the RDD is partitioned by the Mine >> Identifier >> (and Integer). >> >> It's step 3 that is confusing me. I suspect it's very easy ... do I simply >> use PartitionByKey? >> >> We're using Java if that makes any difference. >> >> Thanks! >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/New-to-Spark-Paritioning-Question-tp24580.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > > > -- > *Richard Marscher* > Software Engineer > Localytics > Localytics.com <http://localytics.com/> | Our Blog > <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> | > Facebook <http://facebook.com/localytics> | LinkedIn > <http://www.linkedin.com/company/1148792?trk=tyah> >
Re: New to Spark - Paritioning Question
That seems like it could work, although I don't think `partitionByKey` is a thing, at least for RDD. You might be able to merge step #2 and step #3 into one step by using the `reduceByKey` function signature that takes in a Partitioner implementation. def reduceByKey(partitioner: Partitioner <http://spark.apache.org/docs/latest/api/scala/org/apache/spark/Partitioner.html> , func: (V, V) ⇒ V): RDD <http://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html> [(K, V)] Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. The tricky part might be getting the partitioner to know about the number of partitions, which I think it needs to know upfront in `abstract def numPartitions: Int`. The `HashPartitioner` for example takes in the number as a constructor argument, maybe you could use that with an upper bound size if you don't mind empty partitions. Otherwise you might have to mess around to extract the exact number of keys if it's not readily available. Aside: what is the requirement to have each partition only contain the data related to one key? On Fri, Sep 4, 2015 at 11:06 AM, mmike87 wrote: > Hello, I am new to Apache Spark and this is my company's first Spark > project. > Essentially, we are calculating models dealing with Mining data using > Spark. > > I am holding all the source data in a persisted RDD that we will refresh > periodically. When a "scenario" is passed to the Spark job (we're using Job > Server) the persisted RDD is filtered to the relevant mines. For example, > we > may want all mines in Chile and the 1990-2015 data for each. > > Many of the calculations are cumulative, that is when we apply user-input > "adjustment factors" to a value, we also need the "flexed" value we > calculated for that mine previously. > > To ensure that this works, the idea if to: > > 1) Filter the superset to relevant mines (done) > 2) Group the subset by the unique identifier for the mine. So, a group may > be all the rows for mine "A" for 1990-2015 > 3) I then want to ensure that the RDD is partitioned by the Mine Identifier > (and Integer). > > It's step 3 that is confusing me. I suspect it's very easy ... do I simply > use PartitionByKey? > > We're using Java if that makes any difference. > > Thanks! > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/New-to-Spark-Paritioning-Question-tp24580.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- *Richard Marscher* Software Engineer Localytics Localytics.com <http://localytics.com/> | Our Blog <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> | Facebook <http://facebook.com/localytics> | LinkedIn <http://www.linkedin.com/company/1148792?trk=tyah>
New to Spark - Paritioning Question
Hello, I am new to Apache Spark and this is my company's first Spark project. Essentially, we are calculating models dealing with Mining data using Spark. I am holding all the source data in a persisted RDD that we will refresh periodically. When a "scenario" is passed to the Spark job (we're using Job Server) the persisted RDD is filtered to the relevant mines. For example, we may want all mines in Chile and the 1990-2015 data for each. Many of the calculations are cumulative, that is when we apply user-input "adjustment factors" to a value, we also need the "flexed" value we calculated for that mine previously. To ensure that this works, the idea if to: 1) Filter the superset to relevant mines (done) 2) Group the subset by the unique identifier for the mine. So, a group may be all the rows for mine "A" for 1990-2015 3) I then want to ensure that the RDD is partitioned by the Mine Identifier (and Integer). It's step 3 that is confusing me. I suspect it's very easy ... do I simply use PartitionByKey? We're using Java if that makes any difference. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/New-to-Spark-Paritioning-Question-tp24580.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: NEW to spark and sparksql
I believe functions like sc.textFile will also accept paths with globs for example "/data/*/" which would read all the directories into a single RDD. Under the covers I think it is just using Hadoop's FileInputFormat, in case you want to google for the full list of supported syntax. On Thu, Nov 20, 2014 at 7:27 AM, Sam Flint wrote: > So you are saying to query an entire day of data I would need to create > one RDD for every hour and then union them into one RDD. After I have the > one RDD I would be able to query for a=2 throughout the entire day. > Please correct me if I am wrong. > > Thanks > > On Wed, Nov 19, 2014 at 5:53 PM, Michael Armbrust > wrote: > >> I would use just textFile unless you actually need a guarantee that you >> will be seeing a whole file at time (textFile splits on new lines). >> >> RDDs are immutable, so you cannot add data to them. You can however >> union two RDDs, returning a new RDD that contains all the data. >> >> On Wed, Nov 19, 2014 at 2:46 PM, Sam Flint >> wrote: >> >>> Michael, >>> Thanks for your help. I found a wholeTextFiles() that I can use to >>> import all files in a directory. I believe this would be the case if all >>> the files existed in the same directory. Currently the files come in by >>> the hour and are in a format somewhat like this ../2014/10/01/00/filename >>> and there is one file per hour. >>> >>> Do I create an RDD and add to it? Is that possible? My example query >>> would be select count(*) from (entire day RDD) where a=2. "a" would exist >>> in all files multiple times with multiple values. >>> >>> I don't see in any documentation how to import a file create an RDD then >>> import another file into that RDD. kinda like in mysql when you create a >>> table import data then import more data. This may be my ignorance because >>> I am not that familiar with spark, but I would expect to import data into a >>> single RDD before performing analytics on it. >>> >>> Thank you for your time and help on this. >>> >>> >>> P.S. I am using python if that makes a difference. >>> >>> On Wed, Nov 19, 2014 at 4:45 PM, Michael Armbrust < >>> mich...@databricks.com> wrote: >>> >>>> In general you should be able to read full directories of files as a >>>> single RDD/SchemaRDD. For documentation I'd suggest the programming >>>> guides: >>>> >>>> http://spark.apache.org/docs/latest/quick-start.html >>>> http://spark.apache.org/docs/latest/sql-programming-guide.html >>>> >>>> For Avro in particular, I have been working on a library for Spark >>>> SQL. Its very early code, but you can find it here: >>>> https://github.com/databricks/spark-avro >>>> >>>> Bug reports welcome! >>>> >>>> Michael >>>> >>>> On Wed, Nov 19, 2014 at 1:02 PM, Sam Flint >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I am new to spark. I have began to read to understand sparks RDD >>>>> files as well as SparkSQL. My question is more on how to build out the >>>>> RDD >>>>> files and best practices. I have data that is broken down by hour into >>>>> files on HDFS in avro format. Do I need to create a separate RDD for >>>>> each >>>>> file? or using SparkSQL a separate SchemaRDD? >>>>> >>>>> I want to be able to pull lets say an entire day of data into spark >>>>> and run some analytics on it. Then possibly a week, a month, etc. >>>>> >>>>> >>>>> If there is documentation on this procedure or best practives for >>>>> building RDD's please point me at them. >>>>> >>>>> Thanks for your time, >>>>>Sam >>>>> >>>>> >>>>> >>>>> >>>> >>> >>> >>> -- >>> >>> *MAGNE**+**I**C* >>> >>> *Sam Flint* | *Lead Developer, Data Analytics* >>> >>> >>> >> > > > -- > > *MAGNE**+**I**C* > > *Sam Flint* | *Lead Developer, Data Analytics* > > >
Re: NEW to spark and sparksql
So you are saying to query an entire day of data I would need to create one RDD for every hour and then union them into one RDD. After I have the one RDD I would be able to query for a=2 throughout the entire day. Please correct me if I am wrong. Thanks On Wed, Nov 19, 2014 at 5:53 PM, Michael Armbrust wrote: > I would use just textFile unless you actually need a guarantee that you > will be seeing a whole file at time (textFile splits on new lines). > > RDDs are immutable, so you cannot add data to them. You can however union > two RDDs, returning a new RDD that contains all the data. > > On Wed, Nov 19, 2014 at 2:46 PM, Sam Flint wrote: > >> Michael, >> Thanks for your help. I found a wholeTextFiles() that I can use to >> import all files in a directory. I believe this would be the case if all >> the files existed in the same directory. Currently the files come in by >> the hour and are in a format somewhat like this ../2014/10/01/00/filename >> and there is one file per hour. >> >> Do I create an RDD and add to it? Is that possible? My example query >> would be select count(*) from (entire day RDD) where a=2. "a" would exist >> in all files multiple times with multiple values. >> >> I don't see in any documentation how to import a file create an RDD then >> import another file into that RDD. kinda like in mysql when you create a >> table import data then import more data. This may be my ignorance because >> I am not that familiar with spark, but I would expect to import data into a >> single RDD before performing analytics on it. >> >> Thank you for your time and help on this. >> >> >> P.S. I am using python if that makes a difference. >> >> On Wed, Nov 19, 2014 at 4:45 PM, Michael Armbrust > > wrote: >> >>> In general you should be able to read full directories of files as a >>> single RDD/SchemaRDD. For documentation I'd suggest the programming >>> guides: >>> >>> http://spark.apache.org/docs/latest/quick-start.html >>> http://spark.apache.org/docs/latest/sql-programming-guide.html >>> >>> For Avro in particular, I have been working on a library for Spark SQL. >>> Its very early code, but you can find it here: >>> https://github.com/databricks/spark-avro >>> >>> Bug reports welcome! >>> >>> Michael >>> >>> On Wed, Nov 19, 2014 at 1:02 PM, Sam Flint >>> wrote: >>> >>>> Hi, >>>> >>>> I am new to spark. I have began to read to understand sparks RDD >>>> files as well as SparkSQL. My question is more on how to build out the RDD >>>> files and best practices. I have data that is broken down by hour into >>>> files on HDFS in avro format. Do I need to create a separate RDD for each >>>> file? or using SparkSQL a separate SchemaRDD? >>>> >>>> I want to be able to pull lets say an entire day of data into spark and >>>> run some analytics on it. Then possibly a week, a month, etc. >>>> >>>> >>>> If there is documentation on this procedure or best practives for >>>> building RDD's please point me at them. >>>> >>>> Thanks for your time, >>>>Sam >>>> >>>> >>>> >>>> >>> >> >> >> -- >> >> *MAGNE**+**I**C* >> >> *Sam Flint* | *Lead Developer, Data Analytics* >> >> >> > -- *MAGNE**+**I**C* *Sam Flint* | *Lead Developer, Data Analytics*
Re: NEW to spark and sparksql
I would use just textFile unless you actually need a guarantee that you will be seeing a whole file at time (textFile splits on new lines). RDDs are immutable, so you cannot add data to them. You can however union two RDDs, returning a new RDD that contains all the data. On Wed, Nov 19, 2014 at 2:46 PM, Sam Flint wrote: > Michael, > Thanks for your help. I found a wholeTextFiles() that I can use to > import all files in a directory. I believe this would be the case if all > the files existed in the same directory. Currently the files come in by > the hour and are in a format somewhat like this ../2014/10/01/00/filename > and there is one file per hour. > > Do I create an RDD and add to it? Is that possible? My example query > would be select count(*) from (entire day RDD) where a=2. "a" would exist > in all files multiple times with multiple values. > > I don't see in any documentation how to import a file create an RDD then > import another file into that RDD. kinda like in mysql when you create a > table import data then import more data. This may be my ignorance because > I am not that familiar with spark, but I would expect to import data into a > single RDD before performing analytics on it. > > Thank you for your time and help on this. > > > P.S. I am using python if that makes a difference. > > On Wed, Nov 19, 2014 at 4:45 PM, Michael Armbrust > wrote: > >> In general you should be able to read full directories of files as a >> single RDD/SchemaRDD. For documentation I'd suggest the programming >> guides: >> >> http://spark.apache.org/docs/latest/quick-start.html >> http://spark.apache.org/docs/latest/sql-programming-guide.html >> >> For Avro in particular, I have been working on a library for Spark SQL. >> Its very early code, but you can find it here: >> https://github.com/databricks/spark-avro >> >> Bug reports welcome! >> >> Michael >> >> On Wed, Nov 19, 2014 at 1:02 PM, Sam Flint >> wrote: >> >>> Hi, >>> >>> I am new to spark. I have began to read to understand sparks RDD >>> files as well as SparkSQL. My question is more on how to build out the RDD >>> files and best practices. I have data that is broken down by hour into >>> files on HDFS in avro format. Do I need to create a separate RDD for each >>> file? or using SparkSQL a separate SchemaRDD? >>> >>> I want to be able to pull lets say an entire day of data into spark and >>> run some analytics on it. Then possibly a week, a month, etc. >>> >>> >>> If there is documentation on this procedure or best practives for >>> building RDD's please point me at them. >>> >>> Thanks for your time, >>>Sam >>> >>> >>> >>> >> > > > -- > > *MAGNE**+**I**C* > > *Sam Flint* | *Lead Developer, Data Analytics* > > >
Re: NEW to spark and sparksql
In general you should be able to read full directories of files as a single RDD/SchemaRDD. For documentation I'd suggest the programming guides: http://spark.apache.org/docs/latest/quick-start.html http://spark.apache.org/docs/latest/sql-programming-guide.html For Avro in particular, I have been working on a library for Spark SQL. Its very early code, but you can find it here: https://github.com/databricks/spark-avro Bug reports welcome! Michael On Wed, Nov 19, 2014 at 1:02 PM, Sam Flint wrote: > Hi, > > I am new to spark. I have began to read to understand sparks RDD > files as well as SparkSQL. My question is more on how to build out the RDD > files and best practices. I have data that is broken down by hour into > files on HDFS in avro format. Do I need to create a separate RDD for each > file? or using SparkSQL a separate SchemaRDD? > > I want to be able to pull lets say an entire day of data into spark and > run some analytics on it. Then possibly a week, a month, etc. > > > If there is documentation on this procedure or best practives for building > RDD's please point me at them. > > Thanks for your time, >Sam > > > >
NEW to spark and sparksql
Hi, I am new to spark. I have began to read to understand sparks RDD files as well as SparkSQL. My question is more on how to build out the RDD files and best practices. I have data that is broken down by hour into files on HDFS in avro format. Do I need to create a separate RDD for each file? or using SparkSQL a separate SchemaRDD? I want to be able to pull lets say an entire day of data into spark and run some analytics on it. Then possibly a week, a month, etc. If there is documentation on this procedure or best practives for building RDD's please point me at them. Thanks for your time, Sam