Re: df.dtypes -> pyspark.sql.types
Spark 1.5 is the latest that I have access to and where this problem happens. I don't see it's fixed in master but I might be wrong. diff atatched. https://raw.githubusercontent.com/apache/spark/branch-1.5/python/pyspark/sql/types.py https://raw.githubusercontent.com/apache/spark/d57daf1f7732a7ac54a91fe112deeda0a254f9ef/python/pyspark/sql/types.py -- Ruslan Dautkhanov On Wed, Mar 16, 2016 at 4:44 PM, Reynold Xin wrote: > We probably should have the alias. Is this still a problem on master > branch? > > On Wed, Mar 16, 2016 at 9:40 AM, Ruslan Dautkhanov > wrote: > >> Running following: >> >> #fix schema for gaid which should not be Double >>> from pyspark.sql.types import * >>> customSchema = StructType() >>> for (col,typ) in tsp_orig.dtypes: >>> if col=='Agility_GAID': >>> typ='string' >>> customSchema.add(col,typ,True) >> >> >> Getting >> >> ValueError: Could not parse datatype: bigint >> >> >> Looks like pyspark.sql.types doesn't know anything about bigint.. >> Should it be aliased to LongType in pyspark.sql.types? >> >> Thanks >> >> >> On Wed, Mar 16, 2016 at 10:18 AM, Ruslan Dautkhanov > > wrote: >> >>> Hello, >>> >>> Looking at >>> >>> https://spark.apache.org/docs/1.5.1/api/python/_modules/pyspark/sql/types.html >>> >>> and can't wrap my head around how to convert string data types names to >>> actual >>> pyspark.sql.types data types? >>> >>> Does pyspark.sql.types has an interface to return StringType() for >>> "string", >>> IntegerType() for "integer" etc? If it doesn't exist it would be great >>> to have such a >>> mapping function. >>> >>> Thank you. >>> >>> >>> ps. I have a data frame, and use its dtypes to loop through all columns >>> to fix a few >>> columns' data types as a workaround for SPARK-13866. >>> >>> >>> -- >>> Ruslan Dautkhanov >>> >> >> > 683a684,806 > _FIXED_DECIMAL = re.compile("decimal\\(\\s*(\\d+)\\s*,\\s*(\\d+)\\s*\\)") > > > _BRACKETS = {'(': ')', '[': ']', '{': '}'} > > > def _parse_basic_datatype_string(s): > if s in _all_atomic_types.keys(): > return _all_atomic_types[s]() > elif s == "int": > return IntegerType() > elif _FIXED_DECIMAL.match(s): > m = _FIXED_DECIMAL.match(s) > return DecimalType(int(m.group(1)), int(m.group(2))) > else: > raise ValueError("Could not parse datatype: %s" % s) > > > def _ignore_brackets_split(s, separator): > """ > Splits the given string by given separator, but ignore separators inside > brackets pairs, e.g. > given "a,b" and separator ",", it will return ["a", "b"], but given > "a, d", it will return > ["a", "d"]. > """ > parts = [] > buf = "" > level = 0 > for c in s: > if c in _BRACKETS.keys(): > level += 1 > buf += c > elif c in _BRACKETS.values(): > if level == 0: > raise ValueError("Brackets are not correctly paired: %s" % s) > level -= 1 > buf += c > elif c == separator and level > 0: > buf += c > elif c == separator: > parts.append(buf) > buf = "" > else: > buf += c > > if len(buf) == 0: > raise ValueError("The %s cannot be the last char: %s" % (separator, > s)) > parts.append(buf) > return parts > > > def _parse_struct_fields_string(s): > parts = _ignore_brackets_split(s, ",") > fields = [] > for part in parts: > name_and_type = _ignore_brackets_split(part, ":") > if len(name_and_type) != 2: > raise ValueError("The strcut field string format is: > 'field_name:field_type', " + > "but got: %s" % part) > field_name = name_and_type[0].strip() > field_type = _parse_datatype_string(name_and_type[1]) > fields.append(StructField(field_name, field_type)) > return StructType(fields) > > > def _parse_datatype_string(s): > """ &g
Re: df.dtypes -> pyspark.sql.types
Running following: #fix schema for gaid which should not be Double > from pyspark.sql.types import * > customSchema = StructType() > for (col,typ) in tsp_orig.dtypes: > if col=='Agility_GAID': > typ='string' > customSchema.add(col,typ,True) Getting ValueError: Could not parse datatype: bigint Looks like pyspark.sql.types doesn't know anything about bigint.. Should it be aliased to LongType in pyspark.sql.types? Thanks On Wed, Mar 16, 2016 at 10:18 AM, Ruslan Dautkhanov wrote: > Hello, > > Looking at > > https://spark.apache.org/docs/1.5.1/api/python/_modules/pyspark/sql/types.html > > and can't wrap my head around how to convert string data types names to > actual > pyspark.sql.types data types? > > Does pyspark.sql.types has an interface to return StringType() for > "string", > IntegerType() for "integer" etc? If it doesn't exist it would be great to > have such a > mapping function. > > Thank you. > > > ps. I have a data frame, and use its dtypes to loop through all columns to > fix a few > columns' data types as a workaround for SPARK-13866. > > > -- > Ruslan Dautkhanov >
df.dtypes -> pyspark.sql.types
Hello, Looking at https://spark.apache.org/docs/1.5.1/api/python/_modules/pyspark/sql/types.html and can't wrap my head around how to convert string data types names to actual pyspark.sql.types data types? Does pyspark.sql.types has an interface to return StringType() for "string", IntegerType() for "integer" etc? If it doesn't exist it would be great to have such a mapping function. Thank you. ps. I have a data frame, and use its dtypes to loop through all columns to fix a few columns' data types as a workaround for SPARK-13866. -- Ruslan Dautkhanov
Spark session dies in about 2 days: HDFS_DELEGATION_TOKEN token can't be found
Spark session dies out after ~40 hours when running against Hadoop Secure cluster. spark-submit has --principal and --keytab so kerberos ticket renewal works fine according to logs. Some happens with HDFS dfs connection? These messages come up every 1 second: See complete stack: http://pastebin.com/QxcQvpqm 16/03/11 16:04:59 WARN hdfs.LeaseRenewer: Failed to renew lease for > [DFSClient_NONMAPREDUCE_1534318438_13] for 2802 seconds. Will retry > shortly ... > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > token (HDFS_DELEGATION_TOKEN token 1349 for rdautkha) can't be found in > cache Then in 1 hour it stops trying: 16/03/11 16:18:17 WARN hdfs.DFSClient: Failed to renew lease for > DFSClient_NONMAPREDUCE_1534318438_13 for 3600 seconds (>= hard-limit =3600 > seconds.) Closing all files being written ... > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > token (HDFS_DELEGATION_TOKEN token 1349 for rdautkha) can't be found in > cache It doesn't look it is Kerberos principal ticket renewal problem, because that would expire much sooner (by default we have 12 hours), and from the logs Spark kerberos ticket renewer works fine. It's some sort of other hdfs delegation token renewal process that breaks? RHEL 6.7 > Spark 1.5 > Hadoop 2.6 Found HDFS-5322, YARN-2648 that seem relevant, but I am not sure if it's the same problem. It seems Spark problem as I only seen this problem in Spark. This is reproducible problem, just wait for ~40 hours and a Spark session is no good. Thanks, Ruslan
binary file deserialization
We have a huge binary file in a custom serialization format (e.g. header tells the length of the record, then there is a varying number of items for that record). This is produced by an old c++ application. What would be best approach to deserialize it into a Hive table or a Spark RDD? Format is known and well documented. -- Ruslan Dautkhanov
Re: Spark + Sentry + Kerberos don't add up?
Turns to be it is a Spark issue https://issues.apache.org/jira/browse/SPARK-13478 -- Ruslan Dautkhanov On Mon, Jan 18, 2016 at 4:25 PM, Ruslan Dautkhanov wrote: > Hi Romain, > > Thank you for your response. > > Adding Kerberos support might be as simple as > https://issues.cloudera.org/browse/LIVY-44 ? I.e. add Livy --principal > and --keytab parameters to be passed to spark-submit. > > As a workaround I just did kinit (using hues' keytab) and then launched > Livy Server. It probably will work as long as kerberos ticket doesn't > expire. That's it would be great to have support for --principal and > --keytab parameters for spark-submit as explined in > http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cm_sg_yarn_long_jobs.html > > > The only problem I have currently is the above error stack in my previous > email: > > The Spark session could not be created in the cluster: >> at org.apache.hadoop.security.*UserGroupInformation.doAs*( >> UserGroupInformation.java:1671) >> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1( >> SparkSubmit.scala:160) > > > > >> AFAIK Hive impersonation should be turned off when using Sentry > > Yep, exactly. That's what I did. It is disabled now. But looks like on > other hand, Spark or Spark Notebook want to have that enabled? > It tries to do org.apache.hadoop.security.UserGroupInformation.doAs() > hence the error. > > So Sentry isn't compatible with Spark in kerberized clusters? Is any > workaround for this problem? > > > -- > Ruslan Dautkhanov > > On Mon, Jan 18, 2016 at 3:52 PM, Romain Rigaux > wrote: > >> Livy does not support any Kerberos yet >> https://issues.cloudera.org/browse/LIVY-3 >> >> Are you focusing instead about HS2 + Kerberos with Sentry? >> >> AFAIK Hive impersonation should be turned off when using Sentry: >> http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/sg_sentry_service_config.html >> >> On Sun, Jan 17, 2016 at 10:04 PM, Ruslan Dautkhanov > > wrote: >> >>> Getting following error stack >>> >>> The Spark session could not be created in the cluster: >>>> at org.apache.hadoop.security.*UserGroupInformation.doAs* >>>> (UserGroupInformation.java:1671) >>>> at >>>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:160) >>>> at >>>> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) >>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) >>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ) >>>> at org.*apache.hadoop.hive.metastore.HiveMetaStoreClient* >>>> .open(HiveMetaStoreClient.java:466) >>>> at >>>> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:234) >>>> at >>>> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74) >>>> ... 35 more >>> >>> >>> My understanding that hive.server2.enable.impersonation and >>> hive.server2.enable.doAs should be enabled to make >>> UserGroupInformation.doAs() work? >>> >>> When I try to enable these parameters, Cloudera Manager shows error >>> >>> Hive Impersonation is enabled for Hive Server2 role 'HiveServer2 >>>> (hostname)'. >>>> Hive Impersonation should be disabled to enable Hive authorization >>>> using Sentry >>> >>> >>> So Spark-Hive conflicts with Sentry!? >>> >>> Environment: Hue 3.9 Spark Notebooks + Livy Server (built from master). >>> CDH 5.5. >>> >>> This is a kerberized cluster with Sentry. >>> >>> I was using hue's keytab as hue user is normally (by default in CDH) is >>> allowed to impersonate to other users. >>> So very convenient for Spark Notebooks. >>> >>> Any information to help solve this will be highly appreciated. >>> >>> >>> -- >>> Ruslan Dautkhanov >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Hue-Users" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to hue-user+unsubscr...@cloudera.org. >>> >> >> >
spark.storage.memoryFraction for shuffle-only jobs
For a Spark job that only does shuffling (e.g. Spark SQL with joins, group bys, analytical functions, order bys), but no explicit persistent RDDs nor dataframes (there are no .cache()es in the code), what would be the lowest recommended setting for spark.storage.memoryFraction? spark.storage.memoryFraction defaults to 0.6 which is quite huge for shuffle-only jobs. spark.shuffle.memoryFraction defaults to 0.2 in Spark 1.5.0. Can I set spark.storage.memoryFraction to something low like 0.1 or even lower? And spark.shuffle.memoryFraction to something large like 0.9? or perhaps even 1.0? Thanks!
Re: Hive on Spark knobs
Yep, I tried that. It seems you're right. Got an error that execution engine has to be set to mr. hive.execution.engine = mr I did not keep exact error message/stack. It's probably disabled explicitly. -- Ruslan Dautkhanov On Thu, Jan 28, 2016 at 7:03 AM, Todd wrote: > Did you run hive on spark with spark 1.5 and hive 1.1? > I think hive on spark doesn't support spark 1.5. There are compatibility > issues. > > > At 2016-01-28 01:51:43, "Ruslan Dautkhanov" wrote: > > > https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started > > There are quite a lot of knobs to tune for Hive on Spark. > > Above page recommends following settings: > > mapreduce.input.fileinputformat.split.maxsize=75000 >> hive.vectorized.execution.enabled=true >> hive.cbo.enable=true >> hive.optimize.reducededuplication.min.reducer=4 >> hive.optimize.reducededuplication=true >> hive.orc.splits.include.file.footer=false >> hive.merge.mapfiles=true >> hive.merge.sparkfiles=false >> hive.merge.smallfiles.avgsize=1600 >> hive.merge.size.per.task=25600 >> hive.merge.orcfile.stripe.level=true >> hive.auto.convert.join=true >> hive.auto.convert.join.noconditionaltask=true >> hive.auto.convert.join.noconditionaltask.size=894435328 >> hive.optimize.bucketmapjoin.sortedmerge=false >> hive.map.aggr.hash.percentmemory=0.5 >> hive.map.aggr=true >> hive.optimize.sort.dynamic.partition=false >> hive.stats.autogather=true >> hive.stats.fetch.column.stats=true >> hive.vectorized.execution.reduce.enabled=false >> hive.vectorized.groupby.checkinterval=4096 >> hive.vectorized.groupby.flush.percent=0.1 >> hive.compute.query.using.stats=true >> hive.limit.pushdown.memory.usage=0.4 >> hive.optimize.index.filter=true >> hive.exec.reducers.bytes.per.reducer=67108864 >> hive.smbjoin.cache.rows=1 >> hive.exec.orc.default.stripe.size=67108864 >> hive.fetch.task.conversion=more >> hive.fetch.task.conversion.threshold=1073741824 >> hive.fetch.task.aggr=false >> mapreduce.input.fileinputformat.list-status.num-threads=5 >> spark.kryo.referenceTracking=false >> >> spark.kryo.classesToRegister=org.apache.hadoop.hive.ql.io.HiveKey,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch > > > Did it work for everybody? It may take days if not weeks to try to tune > all of these parameters for a specific job. > > We're on Spark 1.5 / Hive 1.1. > > > ps. We have a job that can't get working well as a Hive job, so thought to > use Hive on Spark instead. (a 3-table full outer joins with group by + > collect_list). Spark should handle this much better. > > > Ruslan > > >
Hive on Spark knobs
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started There are quite a lot of knobs to tune for Hive on Spark. Above page recommends following settings: mapreduce.input.fileinputformat.split.maxsize=75000 > hive.vectorized.execution.enabled=true > hive.cbo.enable=true > hive.optimize.reducededuplication.min.reducer=4 > hive.optimize.reducededuplication=true > hive.orc.splits.include.file.footer=false > hive.merge.mapfiles=true > hive.merge.sparkfiles=false > hive.merge.smallfiles.avgsize=1600 > hive.merge.size.per.task=25600 > hive.merge.orcfile.stripe.level=true > hive.auto.convert.join=true > hive.auto.convert.join.noconditionaltask=true > hive.auto.convert.join.noconditionaltask.size=894435328 > hive.optimize.bucketmapjoin.sortedmerge=false > hive.map.aggr.hash.percentmemory=0.5 > hive.map.aggr=true > hive.optimize.sort.dynamic.partition=false > hive.stats.autogather=true > hive.stats.fetch.column.stats=true > hive.vectorized.execution.reduce.enabled=false > hive.vectorized.groupby.checkinterval=4096 > hive.vectorized.groupby.flush.percent=0.1 > hive.compute.query.using.stats=true > hive.limit.pushdown.memory.usage=0.4 > hive.optimize.index.filter=true > hive.exec.reducers.bytes.per.reducer=67108864 > hive.smbjoin.cache.rows=1 > hive.exec.orc.default.stripe.size=67108864 > hive.fetch.task.conversion=more > hive.fetch.task.conversion.threshold=1073741824 > hive.fetch.task.aggr=false > mapreduce.input.fileinputformat.list-status.num-threads=5 > spark.kryo.referenceTracking=false > > spark.kryo.classesToRegister=org.apache.hadoop.hive.ql.io.HiveKey,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch Did it work for everybody? It may take days if not weeks to try to tune all of these parameters for a specific job. We're on Spark 1.5 / Hive 1.1. ps. We have a job that can't get working well as a Hive job, so thought to use Hive on Spark instead. (a 3-table full outer joins with group by + collect_list). Spark should handle this much better. Ruslan
Re: Spark + Sentry + Kerberos don't add up?
I took liberty and created a JIRA https://github.com/cloudera/livy/issues/36 Feel free to close it if doesn't belong to Livy project. I really don't know if this is a Spark or a Livy/Sentry problem. Any ideas for possible workarounds? Thank you. -- Ruslan Dautkhanov On Mon, Jan 18, 2016 at 4:25 PM, Ruslan Dautkhanov wrote: > Hi Romain, > > Thank you for your response. > > Adding Kerberos support might be as simple as > https://issues.cloudera.org/browse/LIVY-44 ? I.e. add Livy --principal > and --keytab parameters to be passed to spark-submit. > > As a workaround I just did kinit (using hues' keytab) and then launched > Livy Server. It probably will work as long as kerberos ticket doesn't > expire. That's it would be great to have support for --principal and > --keytab parameters for spark-submit as explined in > http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cm_sg_yarn_long_jobs.html > > > The only problem I have currently is the above error stack in my previous > email: > > The Spark session could not be created in the cluster: >> at org.apache.hadoop.security.*UserGroupInformation.doAs*( >> UserGroupInformation.java:1671) >> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1( >> SparkSubmit.scala:160) > > > > >> AFAIK Hive impersonation should be turned off when using Sentry > > Yep, exactly. That's what I did. It is disabled now. But looks like on > other hand, Spark or Spark Notebook want to have that enabled? > It tries to do org.apache.hadoop.security.UserGroupInformation.doAs() > hence the error. > > So Sentry isn't compatible with Spark in kerberized clusters? Is any > workaround for this problem? > > > -- > Ruslan Dautkhanov > > On Mon, Jan 18, 2016 at 3:52 PM, Romain Rigaux > wrote: > >> Livy does not support any Kerberos yet >> https://issues.cloudera.org/browse/LIVY-3 >> >> Are you focusing instead about HS2 + Kerberos with Sentry? >> >> AFAIK Hive impersonation should be turned off when using Sentry: >> http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/sg_sentry_service_config.html >> >> On Sun, Jan 17, 2016 at 10:04 PM, Ruslan Dautkhanov > > wrote: >> >>> Getting following error stack >>> >>> The Spark session could not be created in the cluster: >>>> at org.apache.hadoop.security.*UserGroupInformation.doAs* >>>> (UserGroupInformation.java:1671) >>>> at >>>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:160) >>>> at >>>> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) >>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) >>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ) >>>> at org.*apache.hadoop.hive.metastore.HiveMetaStoreClient* >>>> .open(HiveMetaStoreClient.java:466) >>>> at >>>> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:234) >>>> at >>>> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74) >>>> ... 35 more >>> >>> >>> My understanding that hive.server2.enable.impersonation and >>> hive.server2.enable.doAs should be enabled to make >>> UserGroupInformation.doAs() work? >>> >>> When I try to enable these parameters, Cloudera Manager shows error >>> >>> Hive Impersonation is enabled for Hive Server2 role 'HiveServer2 >>>> (hostname)'. >>>> Hive Impersonation should be disabled to enable Hive authorization >>>> using Sentry >>> >>> >>> So Spark-Hive conflicts with Sentry!? >>> >>> Environment: Hue 3.9 Spark Notebooks + Livy Server (built from master). >>> CDH 5.5. >>> >>> This is a kerberized cluster with Sentry. >>> >>> I was using hue's keytab as hue user is normally (by default in CDH) is >>> allowed to impersonate to other users. >>> So very convenient for Spark Notebooks. >>> >>> Any information to help solve this will be highly appreciated. >>> >>> >>> -- >>> Ruslan Dautkhanov >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Hue-Users" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to hue-user+unsubscr...@cloudera.org. >>> >> >> >
Re: Spark + Sentry + Kerberos don't add up?
Hi Romain, Thank you for your response. Adding Kerberos support might be as simple as https://issues.cloudera.org/browse/LIVY-44 ? I.e. add Livy --principal and --keytab parameters to be passed to spark-submit. As a workaround I just did kinit (using hues' keytab) and then launched Livy Server. It probably will work as long as kerberos ticket doesn't expire. That's it would be great to have support for --principal and --keytab parameters for spark-submit as explined in http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cm_sg_yarn_long_jobs.html The only problem I have currently is the above error stack in my previous email: The Spark session could not be created in the cluster: > at org.apache.hadoop.security.*UserGroupInformation.doAs*( > UserGroupInformation.java:1671) > at org.apache.spark.deploy.SparkSubmit$.doRunMain$1( > SparkSubmit.scala:160) >> AFAIK Hive impersonation should be turned off when using Sentry Yep, exactly. That's what I did. It is disabled now. But looks like on other hand, Spark or Spark Notebook want to have that enabled? It tries to do org.apache.hadoop.security.UserGroupInformation.doAs() hence the error. So Sentry isn't compatible with Spark in kerberized clusters? Is any workaround for this problem? -- Ruslan Dautkhanov On Mon, Jan 18, 2016 at 3:52 PM, Romain Rigaux wrote: > Livy does not support any Kerberos yet > https://issues.cloudera.org/browse/LIVY-3 > > Are you focusing instead about HS2 + Kerberos with Sentry? > > AFAIK Hive impersonation should be turned off when using Sentry: > http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/sg_sentry_service_config.html > > On Sun, Jan 17, 2016 at 10:04 PM, Ruslan Dautkhanov > wrote: > >> Getting following error stack >> >> The Spark session could not be created in the cluster: >>> at org.apache.hadoop.security.*UserGroupInformation.doAs* >>> (UserGroupInformation.java:1671) >>> at >>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:160) >>> at >>> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) >>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) >>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ) >>> at org.*apache.hadoop.hive.metastore.HiveMetaStoreClient* >>> .open(HiveMetaStoreClient.java:466) >>> at >>> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:234) >>> at >>> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74) >>> ... 35 more >> >> >> My understanding that hive.server2.enable.impersonation and >> hive.server2.enable.doAs should be enabled to make >> UserGroupInformation.doAs() work? >> >> When I try to enable these parameters, Cloudera Manager shows error >> >> Hive Impersonation is enabled for Hive Server2 role 'HiveServer2 >>> (hostname)'. >>> Hive Impersonation should be disabled to enable Hive authorization using >>> Sentry >> >> >> So Spark-Hive conflicts with Sentry!? >> >> Environment: Hue 3.9 Spark Notebooks + Livy Server (built from master). >> CDH 5.5. >> >> This is a kerberized cluster with Sentry. >> >> I was using hue's keytab as hue user is normally (by default in CDH) is >> allowed to impersonate to other users. >> So very convenient for Spark Notebooks. >> >> Any information to help solve this will be highly appreciated. >> >> >> -- >> Ruslan Dautkhanov >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Hue-Users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to hue-user+unsubscr...@cloudera.org. >> > >
Spark + Sentry + Kerberos don't add up?
Getting following error stack The Spark session could not be created in the cluster: > at org.apache.hadoop.security.*UserGroupInformation.doAs* > (UserGroupInformation.java:1671) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:160) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ) > at org.*apache.hadoop.hive.metastore.HiveMetaStoreClient* > .open(HiveMetaStoreClient.java:466) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:234) > at > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74) > ... 35 more My understanding that hive.server2.enable.impersonation and hive.server2.enable.doAs should be enabled to make UserGroupInformation.doAs() work? When I try to enable these parameters, Cloudera Manager shows error Hive Impersonation is enabled for Hive Server2 role 'HiveServer2 > (hostname)'. > Hive Impersonation should be disabled to enable Hive authorization using > Sentry So Spark-Hive conflicts with Sentry!? Environment: Hue 3.9 Spark Notebooks + Livy Server (built from master). CDH 5.5. This is a kerberized cluster with Sentry. I was using hue's keytab as hue user is normally (by default in CDH) is allowed to impersonate to other users. So very convenient for Spark Notebooks. Any information to help solve this will be highly appreciated. -- Ruslan Dautkhanov
livy test problem: Failed to execute goal org.scalatest:scalatest-maven-plugin:1.0:test (test) on project livy-spark_2.10: There are test failures
Livy build test from master fails with below problem. Can't track it down. YARN shows Livy Spark yarn application as running. Although attempt to connect to application master shows connection refused: HTTP ERROR 500 > Problem accessing /proxy/application_1448640910222_0046/. Reason: > Connection refused > Caused by: > java.net.ConnectException: Connection refused > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:579) Not sure if Livy server has application master UI? CDH 5.5.1. Below mvn test output footer: > [INFO] > > [INFO] Reactor Summary: > [INFO] > [INFO] livy-main .. SUCCESS [ > 1.299 s] > [INFO] livy-api_2.10 .. SUCCESS [ > 3.622 s] > [INFO] livy-client-common_2.10 SUCCESS [ > 0.862 s] > [INFO] livy-client-local_2.10 . SUCCESS [ > 23.866 s] > [INFO] livy-core_2.10 . SUCCESS [ > 0.316 s] > [INFO] livy-repl_2.10 . SUCCESS [01:00 > min] > [INFO] livy-yarn_2.10 . SUCCESS [ > 0.215 s] > [INFO] livy-spark_2.10 FAILURE [ > 17.382 s] > [INFO] livy-server_2.10 ... SKIPPED > [INFO] livy-assembly_2.10 . SKIPPED > [INFO] livy-client-http_2.10 .. SKIPPED > [INFO] > > [INFO] BUILD FAILURE > [INFO] > > [INFO] Total time: 01:48 min > [INFO] Finished at: 2016-01-14T14:34:28-07:00 > [INFO] Final Memory: 27M/453M > [INFO] > > [ERROR] Failed to execute goal > org.scalatest:scalatest-maven-plugin:1.0:test (test) on project > livy-spark_2.10: There are test failures -> [Help 1] > org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute > goal org.scalatest:scalatest-maven-plugin:1.0:test (test) on project > livy-spark_2.10: There are test failures > at > org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212) > at > org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153) > at > org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145) > at > org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116) > at > org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80) > at > org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51) > at > org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128) > at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307) > at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193) > at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106) > at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863) > at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288) > at org.apache.maven.cli.MavenCli.main(MavenCli.java:199) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289) > at > org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229) > at > org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415) > at > org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356) > Caused by: org.apache.maven.plugin.MojoFailureException: There are test > failures > at org.scalatest.tools.maven.TestMojo.execute(TestMojo.java:107) > at > org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134) > at > org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207) > ... 20 more > [ERROR] > [ERROR] Re-run Maven using the -X switch to enable full debug logging.
Re: Spark on hbase using Phoenix in secure cluster
Try Phoenix from Cloudera parcel distribution https://blog.cloudera.com/blog/2015/11/new-apache-phoenix-4-5-2-package-from-cloudera-labs/ They may have better Kerberos support .. On Tue, Dec 8, 2015 at 12:01 AM Akhilesh Pathodia < pathodia.akhil...@gmail.com> wrote: > Yes, its a kerberized cluster and ticket was generated using kinit command > before running spark job. That's why Spark on hbase worked but when phoenix > is used to get the connection to hbase, it does not pass the authentication > to all nodes. Probably it is not handled in Phoenix version 4.3 or Spark > 1.3.1 does not provide integration with Phoenix for kerberized cluster. > > Can anybody confirm whether Spark 1.3.1 supports Phoenix on secured > cluster or not? > > Thanks, > Akhilesh > > On Tue, Dec 8, 2015 at 2:57 AM, Ruslan Dautkhanov > wrote: > >> That error is not directly related to spark nor hbase >> >> javax.security.sasl.SaslException: GSS initiate failed [Caused by >> GSSException: No valid credentials provided (Mechanism level: Failed to >> find any Kerberos tgt)] >> >> Is this a kerberized cluster? You likely don't have a good (non-expired) >> kerberos ticket for authentication to pass. >> >> >> -- >> Ruslan Dautkhanov >> >> On Mon, Dec 7, 2015 at 12:54 PM, Akhilesh Pathodia < >> pathodia.akhil...@gmail.com> wrote: >> >>> Hi, >>> >>> I am running spark job on yarn in cluster mode in secured cluster. I am >>> trying to run Spark on Hbase using Phoenix, but Spark executors are >>> unable to get hbase connection using phoenix. I am running knit command to >>> get the ticket before starting the job and also keytab file and principal >>> are correctly specified in connection URL. But still spark job on each node >>> throws below error: >>> >>> 15/12/01 03:23:15 ERROR ipc.AbstractRpcClient: SASL authentication >>> failed. The most likely cause is missing or invalid credentials. Consider >>> 'kinit'. >>> javax.security.sasl.SaslException: GSS initiate failed [Caused by >>> GSSException: No valid credentials provided (Mechanism level: Failed to >>> find any Kerberos tgt)] >>> at >>> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212) >>> >>> I am using Spark 1.3.1, Hbase 1.0.0, Phoenix 4.3. I am able to run Spark >>> on Hbase(without phoenix) successfully in yarn-client mode as mentioned in >>> this link: >>> >>> https://github.com/cloudera-labs/SparkOnHBase#scan-that-works-on-kerberos >>> >>> Also, I found that there is a known issue for yarn-cluster mode for >>> Spark 1.3.1 version: >>> >>> https://issues.apache.org/jira/browse/SPARK-6918 >>> >>> Has anybody been successful in running Spark on hbase using Phoenix in >>> yarn cluster or client mode? >>> >>> Thanks, >>> Akhilesh Pathodia >>> >> >> >
Re: Spark on hbase using Phoenix in secure cluster
That error is not directly related to spark nor hbase javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] Is this a kerberized cluster? You likely don't have a good (non-expired) kerberos ticket for authentication to pass. -- Ruslan Dautkhanov On Mon, Dec 7, 2015 at 12:54 PM, Akhilesh Pathodia < pathodia.akhil...@gmail.com> wrote: > Hi, > > I am running spark job on yarn in cluster mode in secured cluster. I am > trying to run Spark on Hbase using Phoenix, but Spark executors are > unable to get hbase connection using phoenix. I am running knit command to > get the ticket before starting the job and also keytab file and principal > are correctly specified in connection URL. But still spark job on each node > throws below error: > > 15/12/01 03:23:15 ERROR ipc.AbstractRpcClient: SASL authentication failed. > The most likely cause is missing or invalid credentials. Consider 'kinit'. > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to > find any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212) > > I am using Spark 1.3.1, Hbase 1.0.0, Phoenix 4.3. I am able to run Spark > on Hbase(without phoenix) successfully in yarn-client mode as mentioned in > this link: > > https://github.com/cloudera-labs/SparkOnHBase#scan-that-works-on-kerberos > > Also, I found that there is a known issue for yarn-cluster mode for Spark > 1.3.1 version: > > https://issues.apache.org/jira/browse/SPARK-6918 > > Has anybody been successful in running Spark on hbase using Phoenix in > yarn cluster or client mode? > > Thanks, > Akhilesh Pathodia >
Re: SparkSQL AVRO
How many reducers you had that created those avro files? Each reducer very likely creates its own avro part- file. We normally use Parquet, but it should be the same for Avro, so this might be relevant http://stackoverflow.com/questions/34026764/how-to-limit-parquet-file-dimension-for-a-parquet-table-in-hive/34059289#34059289 -- Ruslan Dautkhanov On Mon, Dec 7, 2015 at 11:27 AM, Test One wrote: > I'm using spark-avro with SparkSQL to process and output avro files. My > data has the following schema: > > root > |-- memberUuid: string (nullable = true) > |-- communityUuid: string (nullable = true) > |-- email: string (nullable = true) > |-- firstName: string (nullable = true) > |-- lastName: string (nullable = true) > |-- username: string (nullable = true) > |-- profiles: map (nullable = true) > ||-- key: string > ||-- value: string (valueContainsNull = true) > > > When I write the file output as such with: > originalDF.write.avro("masterNew.avro") > > The output location is a folder with masterNew.avro and with many many > files like these: > -rw-r--r-- 1 kcsham access_bpf 8 Dec 2 11:37 ._SUCCESS.crc > -rw-r--r-- 1 kcsham access_bpf44 Dec 2 11:37 > .part-r-0-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc > -rw-r--r-- 1 kcsham access_bpf44 Dec 2 11:37 > .part-r-1-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc > -rw-r--r-- 1 kcsham access_bpf44 Dec 2 11:37 > .part-r-2-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc > -rw-r--r-- 1 kcsham access_bpf 0 Dec 2 11:37 _SUCCESS > -rw-r--r-- 1 kcsham access_bpf 4261 Dec 2 11:37 > part-r-0-0c834f3e-9c15-4470-ad35-02f061826263.avro > -rw-r--r-- 1 kcsham access_bpf 4261 Dec 2 11:37 > part-r-1-0c834f3e-9c15-4470-ad35-02f061826263.avro > -rw-r--r-- 1 kcsham access_bpf 4261 Dec 2 11:37 > part-r-2-0c834f3e-9c15-4470-ad35-02f061826263.avro > > > Where there are ~10 record, it has ~28000 files in that folder. When I > simply want to copy the same dataset to a new location as an exercise from > a local master, it takes long long time and having errors like such as > well. > > 22:01:44.247 [Executor task launch worker-21] WARN > org.apache.spark.storage.MemoryStore - Not enough space to cache > rdd_112058_10705 in memory! (computed 496.0 B so far) > 22:01:44.247 [Executor task launch worker-21] WARN > org.apache.spark.CacheManager - Persisting partition rdd_112058_10705 to > disk instead. > [Stage 0:===> (10706 + 1) / > 28014]22:01:44.574 [Executor task launch worker-21] WARN > org.apache.spark.storage.MemoryStore - Failed to reserve initial memory > threshold of 1024.0 KB for computing block rdd_112058_10706 in memory. > > > I'm attributing that there are way too many files to manipulate. The > questions: > > 1. Is this the expected format of the avro file written by spark-avro? and > each 'part-' is not more than 4k? > 2. My use case is to append new records to the existing dataset using: > originalDF.unionAll(stageDF).write.avro(masterNew) > Any sqlconf, sparkconf that I should set to allow this to work? > > > Thanks, > kc > > >
Re: question about combining small parquet files
An interesting compaction approach of small files is discussed recently http://blog.cloudera.com/blog/2015/11/how-to-ingest-and-query-fast-data-with-impala-without-kudu/ AFAIK Spark supports views too. -- Ruslan Dautkhanov On Thu, Nov 26, 2015 at 10:43 AM, Nezih Yigitbasi < nyigitb...@netflix.com.invalid> wrote: > Hi Spark people, > I have a Hive table that has a lot of small parquet files and I am > creating a data frame out of it to do some processing, but since I have a > large number of splits/files my job creates a lot of tasks, which I don't > want. Basically what I want is the same functionality that Hive provides, > that is, to combine these small input splits into larger ones by specifying > a max split size setting. Is this currently possible with Spark? > > I look at coalesce() but with coalesce I can only control the number > of output files not their sizes. And since the total input dataset size > can vary significantly in my case, I cannot just use a fixed partition > count as the size of each output file can get very large. I then looked for > getting the total input size from an rdd to come up with some heuristic to > set the partition count, but I couldn't find any ways to do it (without > modifying the spark source). > > Any help is appreciated. > > Thanks, > Nezih > > PS: this email is the same as my previous email as I learned that my > previous email ended up as spam for many people since I sent it through > nabble, sorry for the double post. >
Re: Spark REST Job server feedback?
Very good question. From http://gethue.com/new-notebook-application-for-spark-sql/ "Use Livy Spark Job Server from the Hue master repository instead of CDH (it is currently much more advanced): see build & start the latest Livy <https://github.com/cloudera/hue/tree/master/apps/spark/java#welcome-to-livy-the-rest-spark-server> " Although that post is from April 2015, not sure if it's still accurate. -- Ruslan Dautkhanov On Thu, Nov 26, 2015 at 12:04 AM, Deenar Toraskar wrote: > Hi > > I had the same question. Anyone having used Livy and/opr SparkJobServer, > would like their input. > > Regards > Deenar > > > *Think Reactive Ltd* > deenar.toras...@thinkreactive.co.uk > 07714140812 > > > > > On 8 October 2015 at 20:39, Tim Smith wrote: > >> I am curious too - any comparison between the two. Looks like one is >> Datastax sponsored and the other is Cloudera. Other than that, any >> major/core differences in design/approach? >> >> Thanks, >> >> Tim >> >> >> On Mon, Sep 28, 2015 at 8:32 AM, Ramirez Quetzal < >> ramirezquetza...@gmail.com> wrote: >> >>> Anyone has feedback on using Hue / Spark Job Server REST servers? >>> >>> >>> http://gethue.com/how-to-use-the-livy-spark-rest-job-server-for-interactive-spark/ >>> >>> https://github.com/spark-jobserver/spark-jobserver >>> >>> Many thanks, >>> >>> Rami >>> >> >> >> >> -- >> >> -- >> Thanks, >> >> Tim >> > >
Re: Data in one partition after reduceByKey
public long getTime() Returns the number of milliseconds since January 1, 1970, 00:00:00 GMT represented by this Date object. http://docs.oracle.com/javase/7/docs/api/java/util/Date.html#getTime%28%29 Based on what you did i might be easier to get date partitioner from that. Also, to get even more even distriubution you could use a hash function from that not just a remainder. -- Ruslan Dautkhanov On Mon, Nov 23, 2015 at 6:35 AM, Patrick McGloin wrote: > I will answer my own question, since I figured it out. Here is my answer > in case anyone else has the same issue. > > My DateTimes were all without seconds and milliseconds since I wanted to > group data belonging to the same minute. The hashCode() for Joda DateTimes > which are one minute apart is a constant: > > scala> val now = DateTime.now > now: org.joda.time.DateTime = 2015-11-23T11:14:17.088Z > > scala> now.withSecondOfMinute(0).withMillisOfSecond(0).hashCode - > now.minusMinutes(1).withSecondOfMinute(0).withMillisOfSecond(0).hashCode > res42: Int = 6 > > As can be seen by this example, if the hashCode values are similarly > spaced, they can end up in the same partition: > > scala> val nums = for(i <- 0 to 100) yield ((i*20 % 1000), i) > nums: scala.collection.immutable.IndexedSeq[(Int, Int)] = Vector((0,0), > (20,1), (40,2), (60,3), (80,4), (100,5), (120,6), (140,7), (160,8), (180,9), > (200,10), (220,11), (240,12), (260,13), (280,14), (300,15), (320,16), > (340,17), (360,18), (380,19), (400,20), (420,21), (440,22), (460,23), > (480,24), (500,25), (520,26), (540,27), (560,28), (580,29), (600,30), > (620,31), (640,32), (660,33), (680,34), (700,35), (720,36), (740,37), > (760,38), (780,39), (800,40), (820,41), (840,42), (860,43), (880,44), > (900,45), (920,46), (940,47), (960,48), (980,49), (0,50), (20,51), (40,52), > (60,53), (80,54), (100,55), (120,56), (140,57), (160,58), (180,59), (200,60), > (220,61), (240,62), (260,63), (280,64), (300,65), (320,66), (340,67), > (360,68), (380,69), (400,70), (420,71), (440,72), (460,73), (480,74), (500... > > scala> val rddNum = sc.parallelize(nums) > rddNum: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[0] at > parallelize at :23 > > scala> val reducedNum = rddNum.reduceByKey(_+_) > reducedNum: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[1] at > reduceByKey at :25 > > scala> reducedNum.mapPartitions(iter => Array(iter.size).iterator, > true).collect.toList > > res2: List[Int] = List(50, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0) > > To distribute my data more evenly across the partitions I created my own > custom Partitoiner: > > class JodaPartitioner(rddNumPartitions: Int) extends Partitioner { > def numPartitions: Int = rddNumPartitions > def getPartition(key: Any): Int = { > key match { > case dateTime: DateTime => > val sum = dateTime.getYear + dateTime.getMonthOfYear + > dateTime.getDayOfMonth + dateTime.getMinuteOfDay + dateTime.getSecondOfDay > sum % numPartitions > case _ => 0 > } > } > } > > > On 20 November 2015 at 17:17, Patrick McGloin > wrote: > >> Hi, >> >> I have Spark application which contains the following segment: >> >> val reparitioned = rdd.repartition(16) >> val filtered: RDD[(MyKey, myData)] = MyUtils.filter(reparitioned, startDate, >> endDate) >> val mapped: RDD[(DateTime, myData)] = filtered.map(kv=(kv._1.processingTime, >> kv._2)) >> val reduced: RDD[(DateTime, myData)] = mapped.reduceByKey(_+_) >> >> When I run this with some logging this is what I see: >> >> reparitioned ==> [List(2536, 2529, 2526, 2520, 2519, 2514, 2512, 2508, >> 2504, 2501, 2496, 2490, 2551, 2547, 2543, 2537)] >> filtered ==> [List(2081, 2063, 2043, 2040, 2063, 2050, 2081, 2076, 2042, >> 2066, 2032, 2001, 2031, 2101, 2050, 2068)] >> mapped ==> [List(2081, 2063, 2043, 2040, 2063, 2050, 2081, 2076, 2042, >> 2066, 2032, 2001, 2031, 2101, 2050, 2068)] >> reduced ==> [List(0, 0, 0, 0, 0, 0, 922, 0, 0, 0, 0, 0, 0, 0, 0, 0)] >> >> My logging is done using these two lines: >> >> val sizes: RDD[Int] = rdd.mapPartitions(iter => Array(iter.size).iterator, >> true)log.info(s"rdd ==> [${sizes.collect.toList}]") >> >> My question is why does my data end up in one partition after the >> reduceByKey? After the filter it can be seen that the data is evenly >> distributed, but the reduceByKey results in data in only one partition. >> >> Thanks, >> >> Patrick >> > >
Re: ISDATE Function
You could write your own UDF isdate(). -- Ruslan Dautkhanov On Tue, Nov 17, 2015 at 11:25 PM, Ravisankar Mani wrote: > Hi Ted Yu, > > Thanks for your response. Is any other way to achieve in Spark Query? > > > Regards, > Ravi > > On Tue, Nov 17, 2015 at 10:26 AM, Ted Yu wrote: > >> ISDATE() is currently not supported. >> Since it is SQL Server specific, I guess it wouldn't be added to Spark. >> >> On Mon, Nov 16, 2015 at 10:46 PM, Ravisankar Mani >> wrote: >> >>> Hi Everyone, >>> >>> >>> In MSSQL server suppprt "ISDATE()" function is used to fine current >>> column values date or not?. Is any possible to achieve current column >>> values date or not? >>> >>> >>> Regards, >>> Ravi >>> >> >> >
Re: kerberos question
You could probably instead of specifying --principal [principle] --keytab [keytab] --proxy-user [proxied_user] ... arguments just create/renew a kerberos ticket before submitting a job $ kinit prinipal.name -kt keytab.file $ spark-submit ... Do you need impersonation / proxy user at all? I thought it's primary use is for Hue and similar services which uses impersonation quite heavily in kerberized cluster. -- Ruslan Dautkhanov On Wed, Nov 4, 2015 at 1:40 PM, Ted Yu wrote: > 2015-11-04 10:03:31,905 ERROR [Delegation Token Refresh Thread-0] > hdfs.KeyProviderCache (KeyProviderCache.java:createKeyProviderURI(87)) - > Could not find uri with key [dfs. encryption.key.provider.uri] to > create a keyProvider !! > > Could it be related to HDFS-7931 ? > > On Wed, Nov 4, 2015 at 12:30 PM, Chen Song wrote: > >> After a bit more investigation, I found that it could be related to >> impersonation on kerberized cluster. >> >> Our job is started with the following command. >> >> /usr/lib/spark/bin/spark-submit --master yarn-client --principal [principle] >> --keytab [keytab] --proxy-user [proxied_user] ... >> >> >> In application master's log, >> >> At start up, >> >> 2015-11-03 16:03:41,602 INFO [main] yarn.AMDelegationTokenRenewer >> (Logging.scala:logInfo(59)) - Scheduling login from keytab in 64789744 >> millis. >> >> Later on, when the delegation token renewer thread kicks in, it tries to >> re-login with the specified principle with new credentials and tries to >> write the new credentials into the over to the directory where the current >> user's credentials are stored. However, with impersonation, because the >> current user is a different user from the principle user, it fails with >> permission error. >> >> 2015-11-04 10:03:31,366 INFO [Delegation Token Refresh Thread-0] >> yarn.AMDelegationTokenRenewer (Logging.scala:logInfo(59)) - Attempting to >> login to KDC using principal: principal/host@domain >> 2015-11-04 10:03:31,665 INFO [Delegation Token Refresh Thread-0] >> yarn.AMDelegationTokenRenewer (Logging.scala:logInfo(59)) - Successfully >> logged into KDC. >> 2015-11-04 10:03:31,702 INFO [Delegation Token Refresh Thread-0] >> yarn.YarnSparkHadoopUtil (Logging.scala:logInfo(59)) - getting token for >> namenode: >> hdfs://hadoop_abc/user/proxied_user/.sparkStaging/application_1443481003186_0 >> 2015-11-04 10:03:31,904 INFO [Delegation Token Refresh Thread-0] >> hdfs.DFSClient (DFSClient.java:getDelegationToken(1025)) - Created >> HDFS_DELEGATION_TOKEN token 389283 for principal on ha-hdfs:hadoop_abc >> 2015-11-04 10:03:31,905 ERROR [Delegation Token Refresh Thread-0] >> hdfs.KeyProviderCache (KeyProviderCache.java:createKeyProviderURI(87)) - >> Could not find uri with key [dfs.encryption.key.provider.uri] to create a >> keyProvider !! >> 2015-11-04 10:03:31,944 WARN [Delegation Token Refresh Thread-0] >> security.UserGroupInformation (UserGroupInformation.java:doAs(1674)) - >> PriviledgedActionException as:proxy-user (auth:SIMPLE) >> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): >> Operation category READ is not supported in state standby >> 2015-11-04 10:03:31,945 WARN [Delegation Token Refresh Thread-0] ipc.Client >> (Client.java:run(675)) - Exception encountered while connecting to the >> server : >> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): >> Operation category READ is not supported in state standby >> 2015-11-04 10:03:31,945 WARN [Delegation Token Refresh Thread-0] >> security.UserGroupInformation (UserGroupInformation.java:doAs(1674)) - >> PriviledgedActionException as:proxy-user (auth:SIMPLE) >> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): >> Operation category READ is not supported in state standby >> 2015-11-04 10:03:31,963 WARN [Delegation Token Refresh Thread-0] >> yarn.YarnSparkHadoopUtil (Logging.scala:logWarning(92)) - Error while >> attempting to list files from application staging dir >> org.apache.hadoop.security.AccessControlException: Permission denied: >> user=principal, access=READ_EXECUTE, >> inode="/user/proxy-user/.sparkStaging/application_1443481003186_0":proxy-user:proxy-user:drwx-- >> >> >> Can someone confirm my understanding is right? The class relevant is >> below, >> https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/AMDelegationTokenRenewer.scala >> >> Chen >> >>
Re: Pivot Data in Spark and Scala
https://issues.apache.org/jira/browse/SPARK-8992 Should be in 1.6? -- Ruslan Dautkhanov On Thu, Oct 29, 2015 at 5:29 AM, Ascot Moss wrote: > Hi, > > I have data as follows: > > A, 2015, 4 > A, 2014, 12 > A, 2013, 1 > B, 2015, 24 > B, 2013 4 > > > I need to convert the data to a new format: > A ,4,12,1 > B, 24,,4 > > Any idea how to make it in Spark Scala? > > Thanks > >
Re: save DF to JDBC
Thank you Richard and Matthew. DataFrameWriter first appeared in Spark 1.4. Sorry, I should have mentioned earlier, we're on CDH 5.4 / Spark 1.3. No options for this version? Best regards, Ruslan Dautkhanov On Mon, Oct 5, 2015 at 4:00 PM, Richard Hillegas wrote: > Hi Ruslan, > > Here is some sample code which writes a DataFrame to a table in a Derby > database: > > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > > val binaryVal = Array[Byte] ( 1, 2, 3, 4 ) > val timestampVal = java.sql.Timestamp.valueOf("1996-01-01 03:30:36") > val dateVal = java.sql.Date.valueOf("1996-01-01") > > val allTypes = sc.parallelize( > Array( > (1, > 1.toLong, > 1.toDouble, > 1.toFloat, > 1.toShort, > 1.toByte, > "true".toBoolean, > "one ring to rule them all", > binaryVal, > timestampVal, > dateVal, > BigDecimal.valueOf(42549.12) > ) > )).toDF( > "int_col", > "long_col", > "double_col", > "float_col", > "short_col", > "byte_col", > "boolean_col", > "string_col", > "binary_col", > "timestamp_col", > "date_col", > "decimal_col" > ) > > val properties = new java.util.Properties() > > allTypes.write.jdbc("jdbc:derby:/Users/rhillegas/derby/databases/derby1", > "all_spark_types", properties) > > Hope this helps, > > Rick Hillegas > STSM, IBM Analytics, Platform - IBM USA > > > Ruslan Dautkhanov wrote on 10/05/2015 02:44:20 PM: > > > From: Ruslan Dautkhanov > > To: user > > Date: 10/05/2015 02:45 PM > > Subject: save DF to JDBC > > > > http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc- > > to-other-databases > > > > Spark JDBC can read data from JDBC, but can it save back to JDBC? > > Like to an Oracle database through its jdbc driver. > > > > Also looked at SQL Context documentation > > https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/ > > SQLContext.html > > and can't find anything relevant. > > > > Thanks! > > > > > > -- > > Ruslan Dautkhanov >
save DF to JDBC
http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases Spark JDBC can read data from JDBC, but can it save back to JDBC? Like to an Oracle database through its jdbc driver. Also looked at SQL Context documentation https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/SQLContext.html and can't find anything relevant. Thanks! -- Ruslan Dautkhanov
Re: Spark Web UI + NGINX
It should be really simple to setup.. Check this Hue + NGINX setup page http://gethue.com/using-nginx-to-speed-up-hue-3-8-0/ In that config file change 1) > server_name NGINX_HOSTNAME; to "Machine A, with a public IP" 2) > server HUE_HOST1: max_fails=3; > server HUE_HOST2: max_fails=3; to "Machine B, where Spark is installed" 3) You may want to adjust "location /static/" that fits your Spark Web UI.. 4) With a few more config lines you can setup SSL offloading too. -- Ruslan Dautkhanov On Thu, Sep 17, 2015 at 3:06 AM, Renato Perini wrote: > Hello! > I'm trying to set up a reverse proxy (using nginx) for the Spark Web UI. > I have 2 machines: >1) Machine A, with a public IP. This machine will be used to access > Spark Web UI on the Machine B through its private IP address. >2) Machine B, where Spark is installed (standalone master cluster, 1 > worker node and the history server) not accessible from the outside. > > Basically I want to access the Spark Web UI through my Machine A using the > URL: > http://machine_A_ip_address/spark > > Any advised setup for Spark Web UI + nginx? > > Thank you. > > >
Re: Spark data type guesser UDAF
Does it deserve to be a JIRA in Spark / Spark MLLib? How do you guys normally determine data types? Frameworks like h2o automatically determine data type scanning a sample of data, or whole dataset. So then one can decide e.g. if a variable should be a categorical variable or numerical. Another use case is if you get an arbitrary data set (we get them quite often), and want to save as a Parquet table. Providing correct data types make parquet more space effiecient (and probably more query-time performant, e.g. better parquet bloom filters than just storing everything as string/varchar). -- Ruslan Dautkhanov On Thu, Sep 17, 2015 at 12:32 PM, Ruslan Dautkhanov wrote: > Wanted to take something like this > > https://github.com/fitzscott/AirQuality/blob/master/HiveDataTypeGuesser.java > and create a Hive UDAF to create an aggregate function that returns a data > type guess. > Am I inventing a wheel? > Does Spark have something like this already built-in? > Would be very useful for new wide datasets to explore data. Would be > helpful for ML too, > e.g. to decide categorical vs numerical variables. > > > Ruslan > >
Spark data type guesser UDAF
Wanted to take something like this https://github.com/fitzscott/AirQuality/blob/master/HiveDataTypeGuesser.java and create a Hive UDAF to create an aggregate function that returns a data type guess. Am I inventing a wheel? Does Spark have something like this already built-in? Would be very useful for new wide datasets to explore data. Would be helpful for ML too, e.g. to decide categorical vs numerical variables. Ruslan
Re: NGINX + Spark Web UI
Similar setup for Hue http://gethue.com/using-nginx-to-speed-up-hue-3-8-0/ Might give you an idea. -- Ruslan Dautkhanov On Thu, Sep 17, 2015 at 9:50 AM, mjordan79 wrote: > Hello! > I'm trying to set up a reverse proxy (using nginx) for the Spark Web UI. > I have 2 machines: > 1) Machine A, with a public IP. This machine will be used to access Spark > Web UI on the Machine B through its private IP address. > 2) Machine B, where Spark is installed (standalone master cluster, 1 worker > node and the history server) not accessible from the outside. > > Basically I want to access the Spark Web UI through my Machine A using the > URL: > http://machine_A_ip_address/spark > > Currently I have this setup: > http { > proxy_set_header X-Real-IP $remote_addr; > proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; > proxy_set_header Host $http_host; > proxy_set_header X-NginX-Proxy true; > proxy_set_header X-Ssl on; > } > > # Master cluster node > upstream app_master { > server machine_B_ip_address:8080; > } > > # Slave worker node > upstream app_worker { > server machine_B_ip_address:8081; > } > > # Job UI > upstream app_ui { > server machine_B_ip_address:4040; > } > > # History server > upstream app_history { > server machine_B_ip_address:18080; > } > > I'm really struggling in figuring out a correct location directive to make > the whole thing work, not only for accessing all ports using the url /spark > but also in making the links in the web app be transformed accordingly. Any > idea? > > > Any help really appreciated. > Thank you in advance. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/NGINX-Spark-Web-UI-tp24726.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: Spark ANN
Thank you Alexander. Sounds like quite a lot of good and exciting changes slated for Spark's ANN. Looking forward to it. -- Ruslan Dautkhanov On Wed, Sep 9, 2015 at 7:10 PM, Ulanov, Alexander wrote: > Thank you, Feynman, this is helpful. The paper that I linked claims a big > speedup with FFTs for large convolution size. Though as you mentioned there > is no FFT transformer in Spark yet. Moreover, we would need a parallel > version of FFTs to support batch computations. So it probably worth > considering matrix-matrix multiplication for convolution optimization at > least as a first version. It can also take advantage of data batches. > > > > *From:* Feynman Liang [mailto:fli...@databricks.com] > *Sent:* Wednesday, September 09, 2015 12:56 AM > > *To:* Ulanov, Alexander > *Cc:* Ruslan Dautkhanov; Nick Pentreath; user; na...@yandex.ru > *Subject:* Re: Spark ANN > > > > My 2 cents: > > > > * There is frequency domain processing available already (e.g. spark.ml > DCT transformer) but no FFT transformer yet because complex numbers are not > currently a Spark SQL datatype > > * We shouldn't assume signals are even, so we need complex numbers to > implement the FFT > > * I have not closely studied the relative performance tradeoffs, so please > do let me know if there's a significant difference in practice > > > > > > > > On Tue, Sep 8, 2015 at 5:46 PM, Ulanov, Alexander < > alexander.ula...@hpe.com> wrote: > > That is an option too. Implementing convolutions with FFTs should be > considered as well http://arxiv.org/pdf/1312.5851.pdf. > > > > *From:* Feynman Liang [mailto:fli...@databricks.com] > *Sent:* Tuesday, September 08, 2015 12:07 PM > *To:* Ulanov, Alexander > *Cc:* Ruslan Dautkhanov; Nick Pentreath; user; na...@yandex.ru > *Subject:* Re: Spark ANN > > > > Just wondering, why do we need tensors? Is the implementation of convnets > using im2col (see here <http://cs231n.github.io/convolutional-networks/>) > insufficient? > > > > On Tue, Sep 8, 2015 at 11:55 AM, Ulanov, Alexander < > alexander.ula...@hpe.com> wrote: > > Ruslan, thanks for including me in the discussion! > > > > Dropout and other features such as Autoencoder were implemented, but not > merged yet in order to have room for improving the internal Layer API. For > example, there is an ongoing work with convolutional layer that > consumes/outputs 2D arrays. We’ll probably need to change the Layer’s > input/output type to tensors. This will influence dropout which will need > some refactoring to handle tensors too. Also, all new components should > have ML pipeline public interface. There is an umbrella issue for deep > learning in Spark https://issues.apache.org/jira/browse/SPARK-5575 which > includes various features of Autoencoder, in particular > https://issues.apache.org/jira/browse/SPARK-10408. You are very welcome > to join and contribute since there is a lot of work to be done. > > > > Best regards, Alexander > > *From:* Ruslan Dautkhanov [mailto:dautkha...@gmail.com] > *Sent:* Monday, September 07, 2015 10:09 PM > *To:* Feynman Liang > *Cc:* Nick Pentreath; user; na...@yandex.ru > *Subject:* Re: Spark ANN > > > > Found a dropout commit from avulanov: > > > https://github.com/avulanov/spark/commit/3f25e26d10ef8617e46e35953fe0ad1a178be69d > > > > It probably hasn't made its way to MLLib (yet?). > > > > > > -- > Ruslan Dautkhanov > > > > On Mon, Sep 7, 2015 at 8:34 PM, Feynman Liang > wrote: > > Unfortunately, not yet... Deep learning support (autoencoders, RBMs) is on > the roadmap for 1.6 <https://issues.apache.org/jira/browse/SPARK-10324> > though, and there is a spark package > <http://spark-packages.org/package/rakeshchalasani/MLlib-dropout> for > dropout regularized logistic regression. > > > > > > On Mon, Sep 7, 2015 at 3:15 PM, Ruslan Dautkhanov > wrote: > > Thanks! > > > > It does not look Spark ANN yet supports dropout/dropconnect or any other > techniques that help avoiding overfitting? > > http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf > > https://cs.nyu.edu/~wanli/dropc/dropc.pdf > > > > ps. There is a small copy-paste typo in > > > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/BreezeUtil.scala#L43 > > should read B&C :) > > > > > > -- > Ruslan Dautkhanov > > > > On Mon, Sep 7, 2015 at 12:47 PM, Feynman Liang > wrote: > > Backprop is used to compute the gradient here > <https://github.com/apache/spark/blob/master/mllib/src/main/scala
Re: Best way to import data from Oracle to Spark?
Sathish, Thanks for pointing to that. https://docs.oracle.com/cd/E57371_01/doc.41/e57351/copy2bda.htm That must be only part of Oracle's BDA codebase, not open-source Hive, right? -- Ruslan Dautkhanov On Thu, Sep 10, 2015 at 6:59 AM, Sathish Kumaran Vairavelu < vsathishkuma...@gmail.com> wrote: > I guess data pump export from Oracle could be fast option. Hive now has > oracle data pump serde.. > > https://docs.oracle.com/cd/E57371_01/doc.41/e57351/copy2bda.htm > > > > On Wed, Sep 9, 2015 at 4:41 AM Reynold Xin wrote: > >> Using the JDBC data source is probably the best way. >> http://spark.apache.org/docs/1.4.1/sql-programming-guide.html#jdbc-to-other-databases >> >> On Tue, Sep 8, 2015 at 10:11 AM, Cui Lin wrote: >> >>> What's the best way to import data from Oracle to Spark? Thanks! >>> >>> >>> -- >>> Best regards! >>> >>> Lin,Cui >>> >> >>
Re: Avoiding SQL Injection in Spark SQL
Using dataframe API is a good workaround. Another way would be to use bind variables. I don't think Spark SQL supports them. That's what Dinesh probably meant by "was not able to find any API for preparing the SQL statement safely avoiding injection". E.g. val sql_handler = sqlContext.sql("SELECT name FROM people WHERE age >= :var1 AND age <= :var2").parse() toddlers = sql_handler.execute("var1"->1, "var2"->3) teenagers = sql_handler.execute(13, 19) It's not possible to do a SQL Injection if Spark SQL would support bind variables, as parameter would be always treated as variables and not part of SQL. Also it's arguably easier for developers as you don't have to escape/quote. ps. Another advantage is Spark could parse and create plan once - but execute multiple times. http://www.akadia.com/services/ora_bind_variables.html This point is more relevant for OLTP-like queries which Spark is probably not yet good at (e.g. return a few rows quickly/ winthin a few ms). -- Ruslan Dautkhanov On Thu, Sep 10, 2015 at 12:07 PM, Michael Armbrust wrote: > Either that or use the DataFrame API, which directly constructs query > plans and thus doesn't suffer from injection attacks (and runs on the same > execution engine). > > On Thu, Sep 10, 2015 at 12:10 AM, Sean Owen wrote: > >> I don't think this is Spark-specific. Mostly you need to escape / >> quote user-supplied values as with any SQL engine. >> >> On Thu, Sep 10, 2015 at 7:32 AM, V Dineshkumar >> wrote: >> > Hi, >> > >> > What is the preferred way of avoiding SQL Injection while using Spark >> SQL? >> > In our use case we have to take the parameters directly from the users >> and >> > prepare the SQL Statement.I was not able to find any API for preparing >> the >> > SQL statement safely avoiding injection. >> > >> > Thanks, >> > Dinesh >> > Philips India >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
Re: Best way to import data from Oracle to Spark?
You can also sqoop oracle data in $ sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --table ACTIVITY http://www.rittmanmead.com/2014/03/using-sqoop-for-loading-oracle-data-into-hadoop-on-the-bigdatalite-vm/ -- Ruslan Dautkhanov On Tue, Sep 8, 2015 at 11:11 AM, Cui Lin wrote: > What's the best way to import data from Oracle to Spark? Thanks! > > > -- > Best regards! > > Lin,Cui >
Re: Spark ANN
Found a dropout commit from avulanov: https://github.com/avulanov/spark/commit/3f25e26d10ef8617e46e35953fe0ad1a178be69d It probably hasn't made its way to MLLib (yet?). -- Ruslan Dautkhanov On Mon, Sep 7, 2015 at 8:34 PM, Feynman Liang wrote: > Unfortunately, not yet... Deep learning support (autoencoders, RBMs) is on > the roadmap for 1.6 <https://issues.apache.org/jira/browse/SPARK-10324> > though, and there is a spark package > <http://spark-packages.org/package/rakeshchalasani/MLlib-dropout> for > dropout regularized logistic regression. > > > On Mon, Sep 7, 2015 at 3:15 PM, Ruslan Dautkhanov > wrote: > >> Thanks! >> >> It does not look Spark ANN yet supports dropout/dropconnect or any other >> techniques that help avoiding overfitting? >> http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf >> https://cs.nyu.edu/~wanli/dropc/dropc.pdf >> >> ps. There is a small copy-paste typo in >> >> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/BreezeUtil.scala#L43 >> should read B&C :) >> >> >> >> -- >> Ruslan Dautkhanov >> >> On Mon, Sep 7, 2015 at 12:47 PM, Feynman Liang >> wrote: >> >>> Backprop is used to compute the gradient here >>> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala#L579-L584>, >>> which is then optimized by SGD or LBFGS here >>> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala#L878> >>> >>> On Mon, Sep 7, 2015 at 11:24 AM, Nick Pentreath < >>> nick.pentre...@gmail.com> wrote: >>> >>>> Haven't checked the actual code but that doc says "MLPC employes >>>> backpropagation for learning the model. .."? >>>> >>>> >>>> >>>> — >>>> Sent from Mailbox <https://www.dropbox.com/mailbox> >>>> >>>> >>>> On Mon, Sep 7, 2015 at 8:18 PM, Ruslan Dautkhanov >>> > wrote: >>>> >>>>> http://people.apache.org/~pwendell/spark-releases/latest/ml-ann.html >>>>> >>>>> Implementation seems missing backpropagation? >>>>> Was there is a good reason to omit BP? >>>>> What are the drawbacks of a pure feedforward-only ANN? >>>>> >>>>> Thanks! >>>>> >>>>> >>>>> -- >>>>> Ruslan Dautkhanov >>>>> >>>> >>>> >>> >> >
Re: Spark ANN
Thanks! It does not look Spark ANN yet supports dropout/dropconnect or any other techniques that help avoiding overfitting? http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf https://cs.nyu.edu/~wanli/dropc/dropc.pdf ps. There is a small copy-paste typo in https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/BreezeUtil.scala#L43 should read B&C :) -- Ruslan Dautkhanov On Mon, Sep 7, 2015 at 12:47 PM, Feynman Liang wrote: > Backprop is used to compute the gradient here > <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala#L579-L584>, > which is then optimized by SGD or LBFGS here > <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala#L878> > > On Mon, Sep 7, 2015 at 11:24 AM, Nick Pentreath > wrote: > >> Haven't checked the actual code but that doc says "MLPC employes >> backpropagation for learning the model. .."? >> >> >> >> — >> Sent from Mailbox <https://www.dropbox.com/mailbox> >> >> >> On Mon, Sep 7, 2015 at 8:18 PM, Ruslan Dautkhanov >> wrote: >> >>> http://people.apache.org/~pwendell/spark-releases/latest/ml-ann.html >>> >>> Implementation seems missing backpropagation? >>> Was there is a good reason to omit BP? >>> What are the drawbacks of a pure feedforward-only ANN? >>> >>> Thanks! >>> >>> >>> -- >>> Ruslan Dautkhanov >>> >> >> >
Re: Parquet Array Support Broken?
Read response from Cheng Lian on Aug/27th - it looks the same problem. Workarounds 1. write that parquet file in Spark; 2. upgrade to Spark 1.5. -- Ruslan Dautkhanov On Mon, Sep 7, 2015 at 3:52 PM, Alex Kozlov wrote: > No, it was created in Hive by CTAS, but any help is appreciated... > > On Mon, Sep 7, 2015 at 2:51 PM, Ruslan Dautkhanov > wrote: > >> That parquet table wasn't created in Spark, is it? >> >> There was a recent discussion on this list that complex data types in >> Spark prior to 1.5 often incompatible with Hive for example, if I remember >> correctly. >> On Mon, Sep 7, 2015, 2:57 PM Alex Kozlov wrote: >> >>> I am trying to read an (array typed) parquet file in spark-shell (Spark >>> 1.4.1 with Hadoop 2.6): >>> >>> {code} >>> $ bin/spark-shell >>> log4j:WARN No appenders could be found for logger >>> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). >>> log4j:WARN Please initialize the log4j system properly. >>> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig >>> for more info. >>> Using Spark's default log4j profile: >>> org/apache/spark/log4j-defaults.properties >>> 15/09/07 13:45:22 INFO SecurityManager: Changing view acls to: hivedata >>> 15/09/07 13:45:22 INFO SecurityManager: Changing modify acls to: hivedata >>> 15/09/07 13:45:22 INFO SecurityManager: SecurityManager: authentication >>> disabled; ui acls disabled; users with view permissions: Set(hivedata); >>> users with modify permissions: Set(hivedata) >>> 15/09/07 13:45:23 INFO HttpServer: Starting HTTP Server >>> 15/09/07 13:45:23 INFO Utils: Successfully started service 'HTTP class >>> server' on port 43731. >>> Welcome to >>> __ >>> / __/__ ___ _/ /__ >>> _\ \/ _ \/ _ `/ __/ '_/ >>>/___/ .__/\_,_/_/ /_/\_\ version 1.4.1 >>> /_/ >>> >>> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java >>> 1.8.0) >>> Type in expressions to have them evaluated. >>> Type :help for more information. >>> 15/09/07 13:45:26 INFO SparkContext: Running Spark version 1.4.1 >>> 15/09/07 13:45:26 INFO SecurityManager: Changing view acls to: hivedata >>> 15/09/07 13:45:26 INFO SecurityManager: Changing modify acls to: hivedata >>> 15/09/07 13:45:26 INFO SecurityManager: SecurityManager: authentication >>> disabled; ui acls disabled; users with view permissions: Set(hivedata); >>> users with modify permissions: Set(hivedata) >>> 15/09/07 13:45:27 INFO Slf4jLogger: Slf4jLogger started >>> 15/09/07 13:45:27 INFO Remoting: Starting remoting >>> 15/09/07 13:45:27 INFO Remoting: Remoting started; listening on >>> addresses :[akka.tcp://sparkDriver@10.10.30.52:46083] >>> 15/09/07 13:45:27 INFO Utils: Successfully started service 'sparkDriver' >>> on port 46083. >>> 15/09/07 13:45:27 INFO SparkEnv: Registering MapOutputTracker >>> 15/09/07 13:45:27 INFO SparkEnv: Registering BlockManagerMaster >>> 15/09/07 13:45:27 INFO DiskBlockManager: Created local directory at >>> /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/blockmgr-bd1b8498-9f6a-47c4-ae59-8800563f97d0 >>> 15/09/07 13:45:27 INFO MemoryStore: MemoryStore started with capacity >>> 265.1 MB >>> 15/09/07 13:45:27 INFO HttpFileServer: HTTP File server directory is >>> /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/httpd-3fbe0c9d-c0c5-41ef-bf72-4f0ef59bfa21 >>> 15/09/07 13:45:27 INFO HttpServer: Starting HTTP Server >>> 15/09/07 13:45:27 INFO Utils: Successfully started service 'HTTP file >>> server' on port 38717. >>> 15/09/07 13:45:27 INFO SparkEnv: Registering OutputCommitCoordinator >>> 15/09/07 13:45:27 WARN Utils: Service 'SparkUI' could not bind on port >>> 4040. Attempting port 4041. >>> 15/09/07 13:45:27 INFO Utils: Successfully started service 'SparkUI' on >>> port 4041. >>> 15/09/07 13:45:27 INFO SparkUI: Started SparkUI at >>> http://10.10.30.52:4041 >>> 15/09/07 13:45:27 INFO Executor: Starting executor ID driver on host >>> localhost >>> 15/09/07 13:45:27 INFO Executor: Using REPL class URI: >>> http://10.10.30.52:43731 >>> 15/09/07 13:45:27 INFO Utils: Successfully started service >>> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 60973. >>> 15/09/07 13:45:27 INFO NettyBlockTransferService: Server created on 60973 >>> 15/0
Re: Parquet Array Support Broken?
That parquet table wasn't created in Spark, is it? There was a recent discussion on this list that complex data types in Spark prior to 1.5 often incompatible with Hive for example, if I remember correctly. On Mon, Sep 7, 2015, 2:57 PM Alex Kozlov wrote: > I am trying to read an (array typed) parquet file in spark-shell (Spark > 1.4.1 with Hadoop 2.6): > > {code} > $ bin/spark-shell > log4j:WARN No appenders could be found for logger > (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). > log4j:WARN Please initialize the log4j system properly. > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for > more info. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 15/09/07 13:45:22 INFO SecurityManager: Changing view acls to: hivedata > 15/09/07 13:45:22 INFO SecurityManager: Changing modify acls to: hivedata > 15/09/07 13:45:22 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(hivedata); > users with modify permissions: Set(hivedata) > 15/09/07 13:45:23 INFO HttpServer: Starting HTTP Server > 15/09/07 13:45:23 INFO Utils: Successfully started service 'HTTP class > server' on port 43731. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 1.4.1 > /_/ > > Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0) > Type in expressions to have them evaluated. > Type :help for more information. > 15/09/07 13:45:26 INFO SparkContext: Running Spark version 1.4.1 > 15/09/07 13:45:26 INFO SecurityManager: Changing view acls to: hivedata > 15/09/07 13:45:26 INFO SecurityManager: Changing modify acls to: hivedata > 15/09/07 13:45:26 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(hivedata); > users with modify permissions: Set(hivedata) > 15/09/07 13:45:27 INFO Slf4jLogger: Slf4jLogger started > 15/09/07 13:45:27 INFO Remoting: Starting remoting > 15/09/07 13:45:27 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://sparkDriver@10.10.30.52:46083] > 15/09/07 13:45:27 INFO Utils: Successfully started service 'sparkDriver' > on port 46083. > 15/09/07 13:45:27 INFO SparkEnv: Registering MapOutputTracker > 15/09/07 13:45:27 INFO SparkEnv: Registering BlockManagerMaster > 15/09/07 13:45:27 INFO DiskBlockManager: Created local directory at > /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/blockmgr-bd1b8498-9f6a-47c4-ae59-8800563f97d0 > 15/09/07 13:45:27 INFO MemoryStore: MemoryStore started with capacity > 265.1 MB > 15/09/07 13:45:27 INFO HttpFileServer: HTTP File server directory is > /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/httpd-3fbe0c9d-c0c5-41ef-bf72-4f0ef59bfa21 > 15/09/07 13:45:27 INFO HttpServer: Starting HTTP Server > 15/09/07 13:45:27 INFO Utils: Successfully started service 'HTTP file > server' on port 38717. > 15/09/07 13:45:27 INFO SparkEnv: Registering OutputCommitCoordinator > 15/09/07 13:45:27 WARN Utils: Service 'SparkUI' could not bind on port > 4040. Attempting port 4041. > 15/09/07 13:45:27 INFO Utils: Successfully started service 'SparkUI' on > port 4041. > 15/09/07 13:45:27 INFO SparkUI: Started SparkUI at http://10.10.30.52:4041 > 15/09/07 13:45:27 INFO Executor: Starting executor ID driver on host > localhost > 15/09/07 13:45:27 INFO Executor: Using REPL class URI: > http://10.10.30.52:43731 > 15/09/07 13:45:27 INFO Utils: Successfully started service > 'org.apache.spark.network.netty.NettyBlockTransferService' on port 60973. > 15/09/07 13:45:27 INFO NettyBlockTransferService: Server created on 60973 > 15/09/07 13:45:27 INFO BlockManagerMaster: Trying to register BlockManager > 15/09/07 13:45:27 INFO BlockManagerMasterEndpoint: Registering block > manager localhost:60973 with 265.1 MB RAM, BlockManagerId(driver, > localhost, 60973) > 15/09/07 13:45:27 INFO BlockManagerMaster: Registered BlockManager > 15/09/07 13:45:28 INFO SparkILoop: Created spark context.. > Spark context available as sc. > 15/09/07 13:45:28 INFO HiveContext: Initializing execution hive, version > 0.13.1 > 15/09/07 13:45:28 INFO HiveMetaStore: 0: Opening raw store with > implemenation class:org.apache.hadoop.hive.metastore.ObjectStore > 15/09/07 13:45:29 INFO ObjectStore: ObjectStore, initialize called > 15/09/07 13:45:29 INFO Persistence: Property > hive.metastore.integral.jdo.pushdown unknown - will be ignored > 15/09/07 13:45:29 INFO Persistence: Property datanucleus.cache.level2 > unknown - will be ignored > 15/09/07 13:45:29 WARN Connection: BoneCP specified but not present in > CLASSPATH (or one of dependencies) > 15/09/07 13:45:29 WARN Connection: BoneCP specified but not present in > CLASSPATH (or one of dependencies) > 15/09/07 13:45:36 INFO ObjectStore: Setting MetaStore object pin classes > with > hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,
Spark ANN
http://people.apache.org/~pwendell/spark-releases/latest/ml-ann.html Implementation seems missing backpropagation? Was there is a good reason to omit BP? What are the drawbacks of a pure feedforward-only ANN? Thanks! -- Ruslan Dautkhanov
Re: Ranger-like Security on Spark
You could define access in Sentry and enable permissions sync with HDFS, so you could just grant access on Hive per-database or per-table basis. It should work for Spark too, as Sentry will propage "grants" to HDFS acls. http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/sg_hdfs_sentry_sync.html -- Ruslan Dautkhanov On Thu, Sep 3, 2015 at 1:46 PM, Daniel Schulz wrote: > Hi Matei, > > Thanks for your answer. > > My question is regarding simple authenticated Spark-on-YARN only, without > Kerberos. So when I run Spark on YARN and HDFS, Spark will pass through my > HDFS user and only be able to access files I am entitled to read/write? > Will it enforce HDFS ACLs and Ranger policies as well? > > Best regards, Daniel. > > > On 03 Sep 2015, at 21:16, Matei Zaharia wrote: > > > > If you run on YARN, you can use Kerberos, be authenticated as the right > user, etc in the same way as MapReduce jobs. > > > > Matei > > > >> On Sep 3, 2015, at 1:37 PM, Daniel Schulz > wrote: > >> > >> Hi, > >> > >> I really enjoy using Spark. An obstacle to sell it to our clients > currently is the missing Kerberos-like security on a Hadoop with simple > authentication. Are there plans, a proposal, or a project to deliver a > Ranger plugin or something similar to Spark. The target is to differentiate > users and their privileges when reading and writing data to HDFS? Is > Kerberos my only option then? > >> > >> Kind regards, Daniel. > >> - > >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >> For additional commands, e-mail: user-h...@spark.apache.org > > > > > > - > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: FAILED_TO_UNCOMPRESS error from Snappy
https://issues.apache.org/jira/browse/SPARK-7660 ? -- Ruslan Dautkhanov On Thu, Aug 20, 2015 at 1:49 PM, Kohki Nishio wrote: > Right after upgraded to 1.4.1, we started seeing this exception and yes we > picked up snappy-java-1.1.1.7 (previously snappy-java-1.1.1.6). Is there > anything I could try ? I don't have a repro case. > > org.apache.spark.shuffle.FetchFailedException: FAILED_TO_UNCOMPRESS(5) > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84) > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:127) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:169) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:168) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:168) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: FAILED_TO_UNCOMPRESS(5) > at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84) > at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) > at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444) > at org.xerial.snappy.Snappy.uncompress(Snappy.java:480) > at > org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:135) > at > org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:92) > at > org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58) > at > org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:160) > at > org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1178) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$4.apply(ShuffleBlockFetcherIterator.scala:301) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$4.apply(ShuffleBlockFetcherIterator.scala:300) > at scala.util.Success$$anonfun$map$1.apply(Try.scala:206) > at scala.util.Try$.apply(Try.scala:161) > at scala.util.Success.map(Try.scala:206) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51) > ... 33 more > > > -- > Kohki Nishio >
Re: Spark Master HA on YARN
There is no Spark master in YARN mode. It's standalone mode terminology. In YARN cluster mode, Spark's Application Master (Spark Driver runs in it) will be restarted automatically by RM up to yarn.resourcemanager.am.max-retries times (default is 2). -- Ruslan Dautkhanov On Fri, Jul 17, 2015 at 1:29 AM, Bhaskar Dutta wrote: > Hi, > > Is Spark master high availability supported on YARN (yarn-client mode) > analogous to > https://spark.apache.org/docs/1.4.0/spark-standalone.html#high-availability > ? > > Thanks > Bhaskie >
Re: Spark job workflow engine recommendations
We use Talend, but not for Spark workflows. Although it does have Spark componenets. https://www.talend.com/download/talend-open-studio It is free (commercial support available), easy to design and deploy workflows. Talend for BigData 6.0 was released as month ago. Is anybody using Talend for Spark? -- Ruslan Dautkhanov On Tue, Aug 11, 2015 at 11:30 AM, Hien Luu wrote: > We are in the middle of figuring that out. At the high level, we want to > combine the best parts of existing workflow solutions. > > On Fri, Aug 7, 2015 at 3:55 PM, Vikram Kone wrote: > >> Hien, >> Is Azkaban being phased out at linkedin as rumored? If so, what's >> linkedin going to use for workflow scheduling? Is there something else >> that's going to replace Azkaban? >> >> On Fri, Aug 7, 2015 at 11:25 AM, Ted Yu wrote: >> >>> In my opinion, choosing some particular project among its peers should >>> leave enough room for future growth (which may come faster than you >>> initially think). >>> >>> Cheers >>> >>> On Fri, Aug 7, 2015 at 11:23 AM, Hien Luu wrote: >>> >>>> Scalability is a known issue due the the current architecture. However >>>> this will be applicable if you run more 20K jobs per day. >>>> >>>> On Fri, Aug 7, 2015 at 10:30 AM, Ted Yu wrote: >>>> >>>>> From what I heard (an ex-coworker who is Oozie committer), Azkaban is >>>>> being phased out at LinkedIn because of scalability issues (though >>>>> UI-wise, >>>>> Azkaban seems better). >>>>> >>>>> Vikram: >>>>> I suggest you do more research in related projects (maybe using their >>>>> mailing lists). >>>>> >>>>> Disclaimer: I don't work for LinkedIn. >>>>> >>>>> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath < >>>>> nick.pentre...@gmail.com> wrote: >>>>> >>>>>> Hi Vikram, >>>>>> >>>>>> We use Azkaban (2.5.0) in our production workflow scheduling. We just >>>>>> use local mode deployment and it is fairly easy to set up. It is pretty >>>>>> easy to use and has a nice scheduling and logging interface, as well as >>>>>> SLAs (like kill job and notify if it doesn't complete in 3 hours or >>>>>> whatever). >>>>>> >>>>>> However Spark support is not present directly - we run everything >>>>>> with shell scripts and spark-submit. There is a plugin interface where >>>>>> one >>>>>> could create a Spark plugin, but I found it very cumbersome when I did >>>>>> investigate and didn't have the time to work through it to develop that. >>>>>> >>>>>> It has some quirks and while there is actually a REST API for adding >>>>>> jos and dynamically scheduling jobs, it is not documented anywhere so you >>>>>> kinda have to figure it out for yourself. But in terms of ease of use I >>>>>> found it way better than Oozie. I haven't tried Chronos, and it seemed >>>>>> quite involved to set up. Haven't tried Luigi either. >>>>>> >>>>>> Spark job server is good but as you say lacks some stuff like >>>>>> scheduling and DAG type workflows (independent of spark-defined job >>>>>> flows). >>>>>> >>>>>> >>>>>> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke >>>>>> wrote: >>>>>> >>>>>>> Check also falcon in combination with oozie >>>>>>> >>>>>>> Le ven. 7 août 2015 à 17:51, Hien Luu a >>>>>>> écrit : >>>>>>> >>>>>>>> Looks like Oozie can satisfy most of your requirements. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> I'm looking for open source workflow tools/engines that allow us >>>>>>>>> to schedule spark jobs on a datastax cassandra cluster. Since there >>>>>>>>> are >>>>>>>>> tonnes of alternatives out there like Ozzie, Azkaban, Luigi , Chronos >>>>>>>>> etc, >>>>>>>>> I wanted to check with people here to see what they are using today. >>>>>>>>> >>>>>>>>> Some of the requirements of the workflow engine that I'm looking >>>>>>>>> for are >>>>>>>>> >>>>>>>>> 1. First class support for submitting Spark jobs on Cassandra. Not >>>>>>>>> some wrapper Java code to submit tasks. >>>>>>>>> 2. Active open source community support and well tested at >>>>>>>>> production scale. >>>>>>>>> 3. Should be dead easy to write job dependencices using XML or web >>>>>>>>> interface . Ex; job A depends on Job B and Job C, so run Job A after >>>>>>>>> B and >>>>>>>>> C are finished. Don't need to write full blown java applications to >>>>>>>>> specify >>>>>>>>> job parameters and dependencies. Should be very simple to use. >>>>>>>>> 4. Time based recurrent scheduling. Run the spark jobs at a given >>>>>>>>> time every hour or day or week or month. >>>>>>>>> 5. Job monitoring, alerting on failures and email notifications on >>>>>>>>> daily basis. >>>>>>>>> >>>>>>>>> I have looked at Ooyala's spark job server which seems to be hated >>>>>>>>> towards making spark jobs run faster by sharing contexts between the >>>>>>>>> jobs >>>>>>>>> but isn't a full blown workflow engine per se. A combination of spark >>>>>>>>> job >>>>>>>>> server and workflow engine would be ideal >>>>>>>>> >>>>>>>>> Thanks for the inputs >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >>> >> >
Re: collect() works, take() returns "ImportError: No module named iter"
There is was a similar problem reported before on this list. Weird python errors like this generally mean you have different versions of python in the nodes of your cluster. Can you check that? >From error stack you use 2.7.10 |Anaconda 2.3.0 while OS/CDH version of Python is probably 2.6. -- Ruslan Dautkhanov On Mon, Aug 10, 2015 at 3:53 PM, YaoPau wrote: > I'm running Spark 1.3 on CDH 5.4.4, and trying to set up Spark to run via > iPython Notebook. I'm getting collect() to work just fine, but take() > errors. (I'm having issues with collect() on other datasets ... but take() > seems to break every time I run it.) > > My code is below. Any thoughts? > > >> sc > > >> sys.version > '2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) \n[GCC > 4.4.7 20120313 (Red Hat 4.4.7-1)]' > >> hourly = sc.textFile('tester') > >> hourly.collect() > [u'a man', > u'a plan', > u'a canal', > u'panama'] > >> hourly = sc.textFile('tester') > >> hourly.take(2) > --- > Py4JJavaError Traceback (most recent call last) > in () > 1 hourly = sc.textFile('tester') > > 2 hourly.take(2) > > /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/rdd.py in take(self, > num) >1223 >1224 p = range(partsScanned, min(partsScanned + > numPartsToTry, totalParts)) > -> 1225 res = self.context.runJob(self, takeUpToNumLeft, p, > True) >1226 >1227 items += res > > /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/context.py in > runJob(self, rdd, partitionFunc, partitions, allowLocal) > 841 # SparkContext#runJob. > 842 mappedRDD = rdd.mapPartitions(partitionFunc) > --> 843 it = self._jvm.PythonRDD.runJob(self._jsc.sc(), > mappedRDD._jrdd, javaPartitions, allowLocal) > 844 return list(mappedRDD._collect_iterator_through_file(it)) > 845 > > > /opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py > in __call__(self, *args) > 536 answer = self.gateway_client.send_command(command) > 537 return_value = get_return_value(answer, > self.gateway_client, > --> 538 self.target_id, self.name) > 539 > 540 for temp_arg in temp_args: > > > /opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py > in get_return_value(answer, gateway_client, target_id, name) > 298 raise Py4JJavaError( > 299 'An error occurred while calling {0}{1}{2}.\n'. > --> 300 format(target_id, '.', name), value) > 301 else: > 302 raise Py4JError( > > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage > 10.0 (TID 47, dhd490101.autotrader.com): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File > > "/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p894.568/jars/spark-assembly-1.3.0-cdh5.4.4-hadoop2.6.0-cdh5.4.4.jar/pyspark/worker.py", > line 101, in main > process() > File > > "/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p894.568/jars/spark-assembly-1.3.0-cdh5.4.4-hadoop2.6.0-cdh5.4.4.jar/pyspark/worker.py", > line 96, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File > > "/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p894.568/jars/spark-assembly-1.3.0-cdh5.4.4-hadoop2.6.0-cdh5.4.4.jar/pyspark/serializers.py", > line 236, in dump_stream > vs = list(itertools.islice(iterator, batch)) > File "/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/rdd.py", line > 1220, in takeUpToNumLeft > while taken < left: > ImportError: No module named iter > > at > org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135) > at > org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:176) > at > org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(
Re: TCP/IP speedup
If your network is bandwidth-bound, you'll see setting jumbo frames (MTU 9000) may increase bandwidth up to ~20%. http://docs.hortonworks.com/HDP2Alpha/index.htm#Hardware_Recommendations_for_Hadoop.htm "Enabling Jumbo Frames across the cluster improves bandwidth" If Spark workload is not network bandwidth-bound, I can see it'll be a few percent to no improvement. -- Ruslan Dautkhanov On Sat, Aug 1, 2015 at 6:08 PM, Simon Edelhaus wrote: > H > > 2% huh. > > > -- ttfn > Simon Edelhaus > California 2015 > > On Sat, Aug 1, 2015 at 3:45 PM, Mark Hamstra > wrote: > >> https://spark-summit.org/2015/events/making-sense-of-spark-performance/ >> >> On Sat, Aug 1, 2015 at 3:24 PM, Simon Edelhaus wrote: >> >>> Hi All! >>> >>> How important would be a significant performance improvement to TCP/IP >>> itself, in terms of >>> overall job performance improvement. Which part would be most >>> significantly accelerated? >>> Would it be HDFS? >>> >>> -- ttfn >>> Simon Edelhaus >>> California 2015 >>> >> >> >
Re: Spark Number of Partitions Recommendations
You should also take into account amount of memory that you plan to use. It's advised not to give too much memory for each executor .. otherwise GC overhead will go up. Btw, why prime numbers? -- Ruslan Dautkhanov On Wed, Jul 29, 2015 at 3:31 AM, ponkin wrote: > Hi Rahul, > > Where did you see such a recommendation? > I personally define partitions with the following formula > > partitions = nextPrimeNumberAbove( K*(--num-executors * --executor-cores ) > ) > > where > nextPrimeNumberAbove(x) - prime number which is greater than x > K - multiplicator to calculate start with 1 and encrease untill join > perfomance start to degrade > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Number-of-Partitions-Recommendations-tp24022p24059.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: Is SPARK is the right choice for traditional OLAP query processing?
>> We want these use actions respond within 2 to 5 seconds. I think this goal is a stretch for Spark. Some queries may run faster than that on a large dataset, but in general you can't put an SLA like this. For example if you have to join some huge datasets, you'll likely will be much over that. Spark is great for huge jobs and it'll be much faster than MR. I don't think Spark was designed with interactive queries in mind. For example, although Spark is "in-memory", its in-memory is only for a job. It's not like in traditional RDBMS systems where you have a persistent "buffer cache" or "in-memory columnar storage" (both are Oracle terms) If you have multiple users running interatactive BI queries, results that were cached for first user wouldn't be used by second user. Unless you invent something that would keep a persistent Spark context and serve users' requests and decided which RDDs to cache, when and how. At least that's my understanding how Spark works. If I'm wrong, I will be glad to hear that as we ran into the same questions. As we use Cloudera's CDH, I'm not sure where Hortonworks are with their Tez project, but Tez has components that resemble closer to "buffer cache" or "in-memory columnar storage" caching from traditional RDBMS systems, and may get better and/or more predictable performance on BI queries. -- Ruslan Dautkhanov On Mon, Jul 20, 2015 at 6:04 PM, renga.kannan wrote: > All, > I really appreciate anyone's input on this. We are having a very simple > traditional OLAP query processing use case. Our use case is as follows. > > > 1. We have a customer sales order table data coming from RDBMs table. > 2. There are many dimension columns in the sales order table. For each of > those dimensions, we have individual dimension tables that stores the > dimension record sets. > 3. We also have some BI like hierarchies that is defined for dimension data > set. > > What we want for business users is as follows.? > > 1. We wanted to show some aggregated values from sales Order transaction > table columns. > 2. User would like to filter these with specific dimension values from > dimension table. > 3. User should be able to drill down from higher level to lower level by > traversing hierarchy on dimension > > > We want these use actions respond within 2 to 5 seconds. > > > We are thinking about using SPARK as our backend enginee to sever data to > these front end application. > > > Has anyone tried using SPARK for these kind of use cases. These are all > traditional use cases in BI space. If so, can SPARK respond to these > queries > with in 2 to 5 seconds for large data sets. > > Thanks, > Renga > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Is-SPARK-is-the-right-choice-for-traditional-OLAP-query-processing-tp23921.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: Is IndexedRDD available in Spark 1.4.0?
Or Spark on HBase ) http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/ -- Ruslan Dautkhanov On Tue, Jul 14, 2015 at 7:07 PM, Ted Yu wrote: > bq. that is, key-value stores > > Please consider HBase for this purpose :-) > > On Tue, Jul 14, 2015 at 5:55 PM, Tathagata Das > wrote: > >> I do not recommend using IndexRDD for state management in Spark >> Streaming. What it does not solve out-of-the-box is checkpointing of >> indexRDDs, which important because long running streaming jobs can lead to >> infinite chain of RDDs. Spark Streaming solves it for the updateStateByKey >> operation which you can use, which gives state management capabilities. >> Though for most flexible arbitrary look up of stuff, its better to use a >> dedicated system that is designed and optimized for long term storage of >> data, that is, key-value stores, databases, etc. >> >> On Tue, Jul 14, 2015 at 5:44 PM, Ted Yu wrote: >> >>> Please take a look at SPARK-2365 which is in progress. >>> >>> On Tue, Jul 14, 2015 at 5:18 PM, swetha >>> wrote: >>> >>>> Hi, >>>> >>>> Is IndexedRDD available in Spark 1.4.0? We would like to use this in >>>> Spark >>>> Streaming to do lookups/updates/deletes in RDDs using keys by storing >>>> them >>>> as key/value pairs. >>>> >>>> Thanks, >>>> Swetha >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-IndexedRDD-available-in-Spark-1-4-0-tp23841.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> - >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> >>> >> >
Re: RECEIVED SIGNAL 15: SIGTERM
>> the executor receives a SIGTERM (from whom???) >From YARN Resource Manager. Check if yarn fair scheduler preemption and/or speculative execution are turned on, then it's quite possible and not a bug. -- Ruslan Dautkhanov On Sun, Jul 12, 2015 at 11:29 PM, Jong Wook Kim wrote: > Based on my experience, YARN containers can get SIGTERM when > > - it produces too much logs and use up the hard drive > - it uses off-heap memory more than what is given by > spark.yarn.executor.memoryOverhead configuration. It might be due to too > many classes loaded (less than MaxPermGen but more than memoryOverhead), or > some other off-heap memory allocated by networking library, etc. > - it opens too many file descriptors, which you can check on the executor > node's /proc//fd/ > > Does any of these apply to your situation? > > Jong Wook > > On Jul 7, 2015, at 19:16, Kostas Kougios > wrote: > > I am still receiving these weird sigterms on the executors. The driver > claims > it lost the executor, the executor receives a SIGTERM (from whom???) > > It doesn't seem a memory related issue though increasing memory takes the > job a bit further or completes it. But why? there is no memory pressure on > neither driver nor executor. And nothing in the logs indicating so. > > driver: > > 15/07/07 10:47:04 INFO scheduler.TaskSetManager: Starting task 14762.0 in > stage 0.0 (TID 14762, cruncher03.stratified, PROCESS_LOCAL, 13069 bytes) > 15/07/07 10:47:04 INFO scheduler.TaskSetManager: Finished task 14517.0 in > stage 0.0 (TID 14517) in 15950 ms on cruncher03.stratified (14507/42240) > 15/07/07 10:47:04 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated > or disconnected! Shutting down. cruncher05.stratified:32976 > 15/07/07 10:47:04 ERROR cluster.YarnClusterScheduler: Lost executor 1 on > cruncher05.stratified: remote Rpc client disassociated > 15/07/07 10:47:04 INFO scheduler.TaskSetManager: Re-queueing tasks for 1 > from TaskSet 0.0 > 15/07/07 10:47:04 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated > or disconnected! Shutting down. cruncher05.stratified:32976 > 15/07/07 10:47:04 WARN remote.ReliableDeliverySupervisor: Association with > remote system [akka.tcp://sparkExecutor@cruncher05.stratified:32976] has > failed, address is now gated for [5000] ms. Reason is: [Disassociated]. > > 15/07/07 10:47:04 WARN scheduler.TaskSetManager: Lost task 14591.0 in stage > 0.0 (TID 14591, cruncher05.stratified): ExecutorLostFailure (executor 1 > lost) > > gc log for driver, it doesnt look like it run outofmem: > > 2015-07-07T10:45:19.887+0100: [GC (Allocation Failure) > 1764131K->1391211K(3393024K), 0.0102839 secs] > 2015-07-07T10:46:00.934+0100: [GC (Allocation Failure) > 1764971K->1391867K(3405312K), 0.0099062 secs] > 2015-07-07T10:46:45.252+0100: [GC (Allocation Failure) > 1782011K->1392596K(3401216K), 0.0167572 secs] > > executor: > > 15/07/07 10:47:03 INFO executor.Executor: Running task 14750.0 in stage 0.0 > (TID 14750) > 15/07/07 10:47:03 INFO spark.CacheManager: Partition rdd_493_14750 not > found, computing it > 15/07/07 10:47:03 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED > SIGNAL 15: SIGTERM > 15/07/07 10:47:03 INFO storage.DiskBlockManager: Shutdown hook called > > executor gc log (no outofmem as it seems): > 2015-07-07T10:47:02.332+0100: [GC (GCLocker Initiated GC) > 24696750K->23712939K(33523712K), 0.0416640 secs] > 2015-07-07T10:47:02.598+0100: [GC (GCLocker Initiated GC) > 24700520K->23722043K(33523712K), 0.0391156 secs] > 2015-07-07T10:47:02.862+0100: [GC (Allocation Failure) > 24709182K->23726510K(33518592K), 0.0390784 secs] > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/RECEIVED-SIGNAL-15-SIGTERM-tp23668.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > >
Re: Spark equivalent for Oracle's analytical functions
Should be part of Spark 1.4 https://issues.apache.org/jira/browse/SPARK-1442 I don't see it in the documentation though https://spark.apache.org/docs/latest/sql-programming-guide.html -- Ruslan Dautkhanov On Mon, Jul 6, 2015 at 5:06 AM, gireeshp wrote: > Is there any equivalent of Oracle's *analytical functions* in Spark SQL. > > For example, if I have following data set (say table T): > /EID|DEPT > 101|COMP > 102|COMP > 103|COMP > 104|MARK/ > > In Oracle, I can do something like > /select EID, DEPT, count(1) over (partition by DEPT) CNT from T;/ > > to get: > /EID|DEPT|CNT > 101|COMP|3 > 102|COMP|3 > 103|COMP|3 > 104|MARK|1/ > > Can we do an equivalent query in Spark-SQL? Or what is the best method to > get such results in Spark dataframes? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-equivalent-for-Oracle-s-analytical-functions-tp23646.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: Does spark supports the Hive function posexplode function?
You can see what Spark SQL functions are supported in Spark by doing the following in a notebook: %sql show functions https://forums.databricks.com/questions/665/is-hive-coalesce-function-supported-in-sparksql.html I think Spark SQL support is currently around Hive ~0.11? -- Ruslan Dautkhanov On Tue, Jul 7, 2015 at 3:10 PM, Jeff J Li wrote: > I am trying to use the posexplode function in the HiveContext to > auto-generate a sequence number. This feature is supposed to be available > Hive 0.13.0. > > SELECT name, phone FROM contact LATERAL VIEW > posexplode(phoneList.phoneNumber) phoneTable AS pos, phone > > My test program failed with the following > > java.lang.ClassNotFoundException: posexplode > at java.net.URLClassLoader.findClass(URLClassLoader.java:665) > at java.lang.ClassLoader.loadClassHelper(ClassLoader.java:942) > at java.lang.ClassLoader.loadClass(ClassLoader.java:851) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335) > at java.lang.ClassLoader.loadClass(ClassLoader.java:827) > at > org.apache.spark.sql.hive.HiveFunctionWrapper.createFunction(Shim13.scala:147) > at > org.apache.spark.sql.hive.HiveGenericUdtf.function$lzycompute(hiveUdfs.scala:274) > at > org.apache.spark.sql.hive.HiveGenericUdtf.function(hiveUdfs.scala:274) > > Does spark support this Hive function posexplode? If not, how to patch it > to support this? I am on Spark 1.3.1 > > Thanks, > Jeff Li > > >
Re: How to upgrade Spark version in CDH 5.4
Good question. I'd like to know the same. Although I think you'll loose supportability. -- Ruslan Dautkhanov On Wed, Jul 8, 2015 at 2:03 AM, Ashish Dutt wrote: > > Hi, > I need to upgrade spark version 1.3 to version 1.4 on CDH 5.4. > I checked the documentation here > <http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v5-1-x/Cloudera-Manager-Managing-Clusters/cm5mc_upgrade_cdh5_using_parcels.html#xd_583c10bfdbd326ba--6eed2fb8-14349d04bee--7dc6__section_zd5_1yz_l4> > but > I do not see any thing relevant > > Any suggestions directing to a solution are welcome. > > Thanks, > Ashish >
Re: Real-time data visualization with Zeppelin
Don't think it is a Zeppelin problem.. RDDs are "immutable". Unless you integrate something like IndexedRDD http://spark-packages.org/package/amplab/spark-indexedrdd into Zeppelin I think it's not possible. -- Ruslan Dautkhanov On Wed, Jul 8, 2015 at 3:24 PM, Brandon White wrote: > Can you use a con job to update it every X minutes? > > On Wed, Jul 8, 2015 at 2:23 PM, Ganelin, Ilya > wrote: > >> Hi all – I’m just wondering if anyone has had success integrating Spark >> Streaming with Zeppelin and actually dynamically updating the data in near >> real-time. From my investigation, it seems that Zeppelin will only allow >> you to display a snapshot of data, not a continuously updating table. Has >> anyone figured out if there’s a way to loop a display command or how to >> provide a mechanism to continuously update visualizations? >> >> Thank you, >> Ilya Ganelin >> >> >> -- >> >> The information contained in this e-mail is confidential and/or >> proprietary to Capital One and/or its affiliates and may only be used >> solely in performance of work or services for Capital One. The information >> transmitted herewith is intended only for use by the individual or entity >> to which it is addressed. If the reader of this message is not the intended >> recipient, you are hereby notified that any review, retransmission, >> dissemination, distribution, copying or other use of, or taking of any >> action in reliance upon this information is strictly prohibited. If you >> have received this communication in error, please contact the sender and >> delete the material from your computer. >> > >
Re: Caching in spark
Hi Akhil, It's interesting if RDDs are stored internally in a columnar format as well? Or it is only when an RDD is cached in SQL context, it is converted to columnar format. What about data frames? Thanks! -- Ruslan Dautkhanov On Fri, Jul 10, 2015 at 2:07 AM, Akhil Das wrote: > > https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory > > Thanks > Best Regards > > On Fri, Jul 10, 2015 at 10:05 AM, vinod kumar > wrote: > >> Hi Guys, >> >> Can any one please share me how to use caching feature of spark via spark >> sql queries? >> >> -Vinod >> > >
Re: configuring max sum of cores and memory in cluster through command line
It's not possible to specify YARN RM paramers at command line of spark-submit time. You have to specify all resources that are available on your cluster to YARN upfront. If you want to limit amount of resource available for your Spark job, consider using YARN dynamic resource pools instead http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v5-1-x/Cloudera-Manager-Managing-Clusters/cm5mc_resource_pools.html -- Ruslan Dautkhanov On Thu, Jul 2, 2015 at 4:20 PM, Alexander Waldin wrote: > Hi, > > I'd like to specify the total sum of cores / memory as command line > arguments with spark-submit. That is, I'd like to set > yarn.nodemanager.resource.memory-mb and the > yarn.nodemanager.resource.cpu-vcores parameters as described in this blog > <http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/> > post. > > when submitting through the command line, what is the correct way to do > it? Is it: > > --conf spark.yarn.nodemanager.resource.memory-mb=54g > --conf spark.yarn.nodemanager.resource.cpu-vcores=31 > > or > > --conf yarn.nodemanager.resource.memory-mb=54g > --conf yarn.nodemanager.resource.cpu-vcores=31 > > > or something else? I tried these, and I tried looking in the > ResourceManager UI to see if they were set, but couldn't find them. > > Thanks! > > Alexander >
Re: .NET on Apache Spark?
Scala used to run on .NET http://www.scala-lang.org/old/node/10299 -- Ruslan Dautkhanov On Thu, Jul 2, 2015 at 1:26 PM, pedro wrote: > You might try using .pipe() and installing your .NET program as a binary > across the cluster (or using addFile). Its not ideal to pipe things in/out > along with the overhead, but it would work. > > I don't know much about IronPython, but perhaps changing the default python > by changing your path might work? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/NET-on-Apache-Spark-tp23578p23594.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: Problem after enabling Hadoop native libraries
You can run hadoop checknative -a and see if bzip2 is detected correctly. -- Ruslan Dautkhanov On Fri, Jun 26, 2015 at 10:18 AM, Marcelo Vanzin wrote: > What master are you using? If this is not a "local" master, you'll need to > set LD_LIBRARY_PATH on the executors also (using > spark.executor.extraLibraryPath). > > If you are using local, then I don't know what's going on. > > On Fri, Jun 26, 2015 at 1:39 AM, Arunabha Ghosh > wrote: > >> Hi, >> I'm having trouble reading Bzip2 compressed sequence files after I >> enabled hadoop native libraries in spark. >> >> Running >> LD_LIBRARY_PATH=$HADOOP_HOME/lib/native/ $SPARK_HOME/bin/spark-submit >> --class gives the following error >> >> 5/06/26 00:48:02 INFO CodecPool: Got brand-new decompressor [.bz2] >> 15/06/26 00:48:02 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID >> 3) >> java.lang.UnsupportedOperationException >> at >> org.apache.hadoop.io.compress.bzip2.BZip2DummyDecompressor.decompress(BZip2DummyDecompressor.java:32) >> at >> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:91) >> at >> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85) >> at java.io.DataInputStream.readFully(DataInputStream.java:195) >> at java.io.DataInputStream.readLong(DataInputStream.java:416) >> >> removing the LD_LIBRARY_PATH makes spark run fine but it gives the >> following warning >> WARN NativeCodeLoader: Unable to load native-hadoop library for your >> platform... using builtin-java classes where applicable >> >> Has anyone else run into this issue ? Any help is welcome. >> > > > > -- > Marcelo >
Re: flume sinks supported by spark streaming
https://spark.apache.org/docs/latest/streaming-flume-integration.html Yep, avro sink is the correct one. -- Ruslan Dautkhanov On Tue, Jun 23, 2015 at 9:46 PM, Hafiz Mujadid wrote: > Hi! > > > I want to integrate flume with spark streaming. I want to know which sink > type of flume are supported by spark streaming? I saw one example using > avroSink. > > Thanks > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/flume-sinks-supported-by-spark-streaming-tp23462.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: [ERROR] Insufficient Space
Vadim, You could edit /etc/fstab, then issue mount -o remount to give more shared memory online. Didn't know Spark uses shared memory. Hope this helps. On Fri, Jun 19, 2015, 8:15 AM Vadim Bichutskiy wrote: > Hello Spark Experts, > > I've been running a standalone Spark cluster on EC2 for a few months now, > and today I get this error: > > "IOError: [Errno 28] No space left on device > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > OpenJDK 64-Bit Server VM warning: Insufficient space for shared memory > file" > > I guess I need to resize the cluster. What's the best way to do that? > > Thanks, > Vadim > ᐧ >
Re: Does MLLib has attribute importance?
Got it. Thanks! -- Ruslan Dautkhanov On Thu, Jun 18, 2015 at 1:02 PM, Xiangrui Meng wrote: > ChiSqSelector calls an RDD of labeled points, where the label is the > target. See > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala#L120 > > On Wed, Jun 17, 2015 at 10:22 PM, Ruslan Dautkhanov > wrote: > > Thank you Xiangrui. > > > > Oracle's attribute importance mining function have a target variable. > > "Attribute importance is a supervised function that ranks attributes > > according to their significance in predicting a target." > > MLlib's ChiSqSelector does not have a target variable. > > > > > > > > > > -- > > Ruslan Dautkhanov > > > > On Wed, Jun 17, 2015 at 5:50 PM, Xiangrui Meng wrote: > >> > >> We don't have it in MLlib. The closest would be the ChiSqSelector, > >> which works for categorical data. -Xiangrui > >> > >> On Thu, Jun 11, 2015 at 4:33 PM, Ruslan Dautkhanov < > dautkha...@gmail.com> > >> wrote: > >> > What would be closest equivalent in MLLib to Oracle Data Miner's > >> > Attribute > >> > Importance mining function? > >> > > >> > > >> > > http://docs.oracle.com/cd/B28359_01/datamine.111/b28129/feature_extr.htm#i1005920 > >> > > >> > Attribute importance is a supervised function that ranks attributes > >> > according to their significance in predicting a target. > >> > > >> > > >> > Best regards, > >> > Ruslan Dautkhanov > > > > >
Re: Does MLLib has attribute importance?
Thank you Xiangrui. Oracle's attribute importance mining function have a target variable. "Attribute importance is a supervised function that ranks attributes according to their significance in predicting a target." MLlib's ChiSqSelector does not have a target variable. -- Ruslan Dautkhanov On Wed, Jun 17, 2015 at 5:50 PM, Xiangrui Meng wrote: > We don't have it in MLlib. The closest would be the ChiSqSelector, > which works for categorical data. -Xiangrui > > On Thu, Jun 11, 2015 at 4:33 PM, Ruslan Dautkhanov > wrote: > > What would be closest equivalent in MLLib to Oracle Data Miner's > Attribute > > Importance mining function? > > > > > http://docs.oracle.com/cd/B28359_01/datamine.111/b28129/feature_extr.htm#i1005920 > > > > Attribute importance is a supervised function that ranks attributes > > according to their significance in predicting a target. > > > > > > Best regards, > > Ruslan Dautkhanov >
Does MLLib has attribute importance?
What would be closest equivalent in MLLib to Oracle Data Miner's Attribute Importance mining function? http://docs.oracle.com/cd/B28359_01/datamine.111/b28129/feature_extr.htm#i1005920 Attribute importance is a supervised function that ranks attributes according to their significance in predicting a target. Best regards, Ruslan Dautkhanov
k-means for text mining in a streaming context
Hello, https://spark.apache.org/docs/latest/mllib-feature-extraction.html would Feature Extraction and Transformation work in a streaming context? Wanted to extract text features, build K-means clusters for streaming context to detect anomalies on a continuous text stream. Would it be possible? Best reagrds, Ruslan Dautkhanov
Re: Spark Job always cause a node to reboot
vm.swappiness=0? Some vendors recommend this set to 0 (zero), although I've seen this causes even kernel to fail to allocate memory. It may cause node reboot. If that's the case, set vm.swappiness to 5-10 and decrease spark.*.memory. Your spark.driver.memory+ spark.executor.memory + OS + etc >> amount of memory node has. -- Ruslan Dautkhanov On Thu, Jun 4, 2015 at 8:59 AM, Chao Chen wrote: > Hi all, > I am new to spark. I am trying to deploy HDFS (hadoop-2.6.0) and > Spark-1.3.1 with four nodes, and each node has 8-cores and 8GB memory. > One is configured as headnode running masters, and 3 others are workers > > But when I try to run the Pagerank from HiBench, it always cause a node to > reboot during the middle of the work for all scala, java, and python > versions. But works fine > with the MapReduce version from the same benchmark. > > I also tried standalone deployment, got the same issue. > > My spark-defaults.conf > spark.masteryarn-client > spark.driver.memory 4g > spark.executor.memory 4g > spark.rdd.compress false > > > The job submit script is: > > bin/spark-submit --properties-file > HiBench/report/pagerank/spark/scala/conf/sparkbench/spark.conf --class > org.apache.spark.examples.SparkPageRank --master yarn-client > --num-executors 2 --executor-cores 4 --executor-memory 4G --driver-memory > 4G > HiBench/src/sparkbench/target/sparkbench-4.0-SNAPSHOT-MR2-spark1.3-jar-with-dependencies.jar > hdfs://discfarm:9000/HiBench/Pagerank/Input/edges > hdfs://discfarm:9000/HiBench/Pagerank/Output 3 > > What is problem with my configuration ? and How can I find the cause ? > > any help is welcome ! > > > > > > > > > > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: How to monitor Spark Streaming from Kafka?
Nobody mentioned CM yet? Kafka is now supported by CM/CDH 5.4 http://www.cloudera.com/content/cloudera/en/documentation/cloudera-kafka/latest/PDF/cloudera-kafka.pdf -- Ruslan Dautkhanov On Mon, Jun 1, 2015 at 5:19 PM, Dmitry Goldenberg wrote: > Thank you, Tathagata, Cody, Otis. > > - Dmitry > > > On Mon, Jun 1, 2015 at 6:57 PM, Otis Gospodnetic < > otis.gospodne...@gmail.com> wrote: > >> I think you can use SPM - http://sematext.com/spm - it will give you all >> Spark and all Kafka metrics, including offsets broken down by topic, etc. >> out of the box. I see more and more people using it to monitor various >> components in data processing pipelines, a la >> http://blog.sematext.com/2015/04/22/monitoring-stream-processing-tools-cassandra-kafka-and-spark/ >> >> Otis >> >> On Mon, Jun 1, 2015 at 5:23 PM, dgoldenberg >> wrote: >> >>> Hi, >>> >>> What are some of the good/adopted approached to monitoring Spark >>> Streaming >>> from Kafka? I see that there are things like >>> http://quantifind.github.io/KafkaOffsetMonitor, for example. Do they >>> all >>> assume that Receiver-based streaming is used? >>> >>> Then "Note that one disadvantage of this approach (Receiverless Approach, >>> #2) is that it does not update offsets in Zookeeper, hence >>> Zookeeper-based >>> Kafka monitoring tools will not show progress. However, you can access >>> the >>> offsets processed by this approach in each batch and update Zookeeper >>> yourself". >>> >>> The code sample, however, seems sparse. What do you need to do here? - >>> directKafkaStream.foreachRDD( >>> new Function, Void>() { >>> @Override >>> public Void call(JavaPairRDD rdd) throws >>> IOException { >>> OffsetRange[] offsetRanges = >>> ((HasOffsetRanges)rdd).offsetRanges >>> // offsetRanges.length = # of Kafka partitions being >>> consumed >>> ... >>> return null; >>> } >>> } >>> ); >>> >>> and if these are updated, will KafkaOffsetMonitor work? >>> >>> Monitoring seems to center around the notion of a consumer group. But in >>> the receiverless approach, code on the Spark consumer side doesn't seem >>> to >>> expose a consumer group parameter. Where does it go? Can I/should I >>> just >>> pass in group.id as part of the kafkaParams HashMap? >>> >>> Thanks >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-monitor-Spark-Streaming-from-Kafka-tp23103.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >
Re: Value for SPARK_EXECUTOR_CORES
It's not only about cores. Keep in mind spark.executor.cores also affects available memeory for each task: From http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ The memory available to each task is (spark.executor.memory * spark.shuffle.memoryFraction *spark.shuffle.safetyFraction)/ spark.executor.cores. Memory fraction and safety fraction default to 0.2 and 0.8 respectively. I'd test spark.executor.cores with 2,4,8 and 16 and see what makes your job run faster.. -- Ruslan Dautkhanov On Wed, May 27, 2015 at 6:46 PM, Mulugeta Mammo wrote: > My executor has the following spec (lscpu): > > CPU(s): 16 > Core(s) per socket: 4 > Socket(s): 2 > Thread(s) per code: 2 > > The CPU count is obviously 4*2*2 = 16. My question is what value is Spark > expecting in SPARK_EXECUTOR_CORES ? The CPU count (16) or total # of cores > (2 * 2 = 4) ? > > Thanks >
Re: PySpark Logs location
https://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application When log aggregation isn’t turned on, logs are retained locally on each machine under YARN_APP_LOGS_DIR, which is usually configured to/tmp/logs or $HADOOP_HOME/logs/userlogs depending on the Hadoop version and installation. Viewing logs for a container requires going to the host that contains them and looking in this directory. Subdirectories organize log files by application ID and container ID. You can enable log aggregation by changing yarn.log-aggregation-enable to true so it'll be easier to see yarn application logs. -- Ruslan Dautkhanov On Thu, May 21, 2015 at 5:08 AM, Oleg Ruchovets wrote: > Doesn't work for me so far , >using command but got such output. What should I check to fix the > issue? Any configuration parameters ... > > > [root@sdo-hdp-bd-master1 ~]# yarn logs -applicationId > application_1426424283508_0048 > 15/05/21 13:25:09 INFO impl.TimelineClientImpl: Timeline service address: > http://hdp-bd-node1.development.c4i:8188/ws/v1/timeline/ > 15/05/21 13:25:09 INFO client.RMProxy: Connecting to ResourceManager at > hdp-bd-node1.development.c4i/12.23.45.253:8050 > /app-logs/root/logs/application_1426424283508_0048does not exist. > *Log aggregation has not completed or is not enabled.* > > Thanks > Oleg. > > On Wed, May 20, 2015 at 11:33 PM, Ruslan Dautkhanov > wrote: > >> Oleg, >> >> You can see applicationId in your Spark History Server. >> Go to http://historyserver:18088/ >> >> Also check >> https://spark.apache.org/docs/1.1.0/running-on-yarn.html#debugging-your-application >> >> It should be no different with PySpark. >> >> >> -- >> Ruslan Dautkhanov >> >> On Wed, May 20, 2015 at 2:12 PM, Oleg Ruchovets >> wrote: >> >>> Hi Ruslan. >>> Could you add more details please. >>> Where do I get applicationId? In case I have a lot of log files would it >>> make sense to view it from single point. >>> How actually I can configure / manage log location of PySpark? >>> >>> Thanks >>> Oleg. >>> >>> On Wed, May 20, 2015 at 10:24 PM, Ruslan Dautkhanov < >>> dautkha...@gmail.com> wrote: >>> >>>> You could use >>>> >>>> yarn logs -applicationId application_1383601692319_0008 >>>> >>>> >>>> >>>> -- >>>> Ruslan Dautkhanov >>>> >>>> On Wed, May 20, 2015 at 5:37 AM, Oleg Ruchovets >>>> wrote: >>>> >>>>> Hi , >>>>> >>>>> I am executing PySpark job on yarn ( hortonworks distribution). >>>>> >>>>> Could someone pointing me where is the log locations? >>>>> >>>>> Thanks >>>>> Oleg. >>>>> >>>> >>>> >>> >> >
Re: PySpark Logs location
Oleg, You can see applicationId in your Spark History Server. Go to http://historyserver:18088/ Also check https://spark.apache.org/docs/1.1.0/running-on-yarn.html#debugging-your-application It should be no different with PySpark. -- Ruslan Dautkhanov On Wed, May 20, 2015 at 2:12 PM, Oleg Ruchovets wrote: > Hi Ruslan. > Could you add more details please. > Where do I get applicationId? In case I have a lot of log files would it > make sense to view it from single point. > How actually I can configure / manage log location of PySpark? > > Thanks > Oleg. > > On Wed, May 20, 2015 at 10:24 PM, Ruslan Dautkhanov > wrote: > >> You could use >> >> yarn logs -applicationId application_1383601692319_0008 >> >> >> >> -- >> Ruslan Dautkhanov >> >> On Wed, May 20, 2015 at 5:37 AM, Oleg Ruchovets >> wrote: >> >>> Hi , >>> >>> I am executing PySpark job on yarn ( hortonworks distribution). >>> >>> Could someone pointing me where is the log locations? >>> >>> Thanks >>> Oleg. >>> >> >> >
Re: PySpark Logs location
You could use yarn logs -applicationId application_1383601692319_0008 -- Ruslan Dautkhanov On Wed, May 20, 2015 at 5:37 AM, Oleg Ruchovets wrote: > Hi , > > I am executing PySpark job on yarn ( hortonworks distribution). > > Could someone pointing me where is the log locations? > > Thanks > Oleg. >
Re: Reading Nested Fields in DataFrames
Had the same question on stackoverflow recently http://stackoverflow.com/questions/30008127/how-to-read-a-nested-collection-in-spark Lomig Mégard had a detailed answer of how to do this without using LATERAL VIEW. On Mon, May 11, 2015 at 8:05 AM, Ashish Kumar Singh wrote: > Hi , > I am trying to read Nested Avro data in Spark 1.3 using DataFrames. > I need help to retrieve the Inner element data in the Structure below. > > Below is the schema when I enter df.printSchema : > > |-- THROTTLING_PERCENTAGE: double (nullable = false) > |-- IMPRESSION_TYPE: string (nullable = false) > |-- campaignArray: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- COOKIE: string (nullable = false) > |||-- CAMPAIGN_ID: long (nullable = false) > > > How can I access CAMPAIGN_ID field in this schema ? > > Thanks, > Ashish Kr. Singh >